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PREFACE 


This  volume  is  a  collection  of  papers  presented  at  the 
IEEE  Symposium  on  Speech  Recognition  at  Carnegie-Mellon 
University  April  15-19,  1974.  Taken  together  they  represent 
a  snapshot  of  the  BBN  speech  understanding  project  as  of 
approximately  December,  1973.  The  project  is  still  far  from 
complete  and  in  particular  is  not  yet  to  the  point  where 
definitive  conclusions  as  to  the  success  of  the  techniques 
described  can  be  reported.  However,  I  believe  that  the 
present  collection  of  papers  serves  a  useful  purpose  in 
documenting  the  state  of  the  project  and  the  approach  that 
we  are  taking. 


W.A.  Woods 
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MOTIVATION  AND  OVERVIEW  OF  nEN  SPEECIILIS* 

AN  EXPERIMENTAL  PROTOTYPE  FOR  SPEECH  UNDERSTANDING  RKSrARCI' 


'■'.A.  Woods 

Bolt  Boranek  and  Nowran  Inc. 

5(1  Moulton  St.,  Cambridoe ,  Ma.  P213B 


Introduction 


This  paper  describes  a  computer  system 
under  development  at  Bolt  Boranek  and  Newman 
for  carryina  out  research  in  continuous 
speech  understandinq.  The  system  is  a 

research  prototype  of  an  intelliqent  speech 
understandinq  system  which  makes  use  of 
advanced  techniques  of  artificial 
intelliqence,  natural  lanquaqe  processinq, 
and  acoustical  and  phonolooical  analysis  and 
siqnal  processinq  in  an  intoqrated  way  to 
determine  an  interpretation  of  a  continuous 
speech  utterance  which  is  both  syntactically 
and  semantically  plausible  and  consistent 
the  acoustic-phonetic  analysis  of  the 
input  siqnal. 


We  take  as  a  point  of  departure  that  the 
information  required  to  produce  the  correct 
interpretation  of  an  utterance  is  not 
completely  and  unambiquously  encoded  into  the 
speech  siqnal,  but  rather  that  knowledqe  of 
the  vocabulary  and  of  syntactic,  semantic, 
and  praqmatic  constraints  of  the  lanquaqe  are 
used  to  compensate  for  uncertainties  and 
errors  in  the  acoustic  realization  of  the 
u^erance*  This  fact  seems  appropriately 
substantiated  by  human  perceptual' performance 
110]  and  by  Klatt  and  Stevens'  spectroqram 
readinq  experiments  [2J.  in  the  latter, 
human  experts  attempting  to  decipher 
spectroqrams  achieved  error  rates  of 
approximately  25»  in  "partial"  phonetic 
transcription  based  on  snectroqraphic 
evidence  alone  but  were  96%  successful  in 
identifyinq  the  words  of  the  utterances  when 
permitted  to  make  use  of  knowledqe  of  the 
vocabulary  and  of  syntactic  and  semantic 
constraints.  It  is  the  matchinq  of  human 
performance  in  these  experiments  toward? 
which  the  BBN  speech  understandinq  system 
(dubbed  SPEFCHLIS)  aspires. 

In  a  previous  paper  (12]  we  have 
described  a  method  of  "incremental 
simulation  which  we  have  used  to  qet  a 
feelinq  for  the  types  of  interaction  amonq 
the  different  sources  of  knowledqe  used 
durinq  the  understandinq  of  a  speech  siqnal. 
In  that  article,  wo  postulated  the 
decomposition  of  a  speech  understandinq 
system  into  separate  components  and  presented 
an  illustrative  example  of  their  interaction 
in  the  analysis  of  an  utterance.  Wo  also 
discussed  the  types  of  inference  capabilities 
which  would  be  required  from  the  different 
components  in  a  mechanical  speech 
understandinq  system.  In  this  paper  we  will 
describe  how  we  have  attempted  to  embody 
those  capabilities  in  SPEECHLIS. 
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Domain  of  Discourse 

If  one  is  to  use  knowledqe  of 
vocabularv,  syntax,  and  semantics  in  a  speech 
understanding  system,  it  is  necessary  to 
select  what  vocabulary,  svntax,  and  semantics  ' 
to  deal  with.  For  our  initial  domain, 
because  of  its  readv  availability  and  .ts 
sophisticated  syntax  and  semantics*  we 
selected  the  domain  of  the  LUNAR  system 
[11,13],  a  natural  Enqlish  question-answerinq 
system  dealing  with  chemical  analyses  of  the 
Apollo  11  moon  rocks.  The  LUNAR  system 
understands  and  answers  such  questions  as: 

What  is  the  average  concentration  of 
rubidium  in  hiqh-alkali  rocks?" 

"List  ontaosium/rubidium  ratios  for 
samples  not  containinn  silicon." 
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"How  many  rocks  contain  creator  than  1  i> % 
plagioclase?" 

It  contains  a  vocabulary  of  approxinatelv 
351)0  words  anc!  a  grammar  for  an  extensive 
subset  of  general  Fnglisli.  For  our  initial 
speech  system,  wo  have  selected  a  subset  of 
approximately  *!5('  words  from  LUNAR’s 
vocabulary  ami  a  sul  grammar  of  more 
restricted  Fnalish  from  its  grammar.  In  the 
future  we  intend  to  increase  our  vocabulary 
to  over  101)0  words,  extend  our  arammar  to 
include  the  entire  LUNAR  grammar,  and  include 
several  additional  domains  of  discourse 
unrelated  to  lunar  qeoloqv. 


Knowledge  gathering 


In  order  to  gain  a  concrete 
understanding  of  the  tynes  of  interaction 
required  in  using  higher  level  linguistic 
knowledge  to  augment  the  front  end  (acoustic) 
analysis  of  the  sneech  signal,  we  used 
"incremental  simulations"  to  begin 
experimenting  with  the  speech  understanding 
system  before  completing  its  construction  by 
"implementing"  its  comoouents  as  combinations 
of  computer  program  and  human  simulators. 
From  these  simulations,  the  following  general 
conclusions  were  readied: 


1)  Small  function  words  such  as  "a",  "of", 

"the",  etc.,  which  are  generally 
unstressed  and  short,  have  a  hiqh 
probability  of  matcliing  accidentally  in 
the  signal.  Thev  are  therefore  unreliable 
cues  by  themselves  on  wtiich  to  make  a 
decision  about  an  utterance  and  are 
unprofitable  to  look  for  on  a  "bottom  up" 
or  analytical  seen  of  the  utterance. 
However,  wtien  the  hynothesized  "content 
words"  of  the  utterance  are  being  -.arsed 
according  to  a  grammar  of  Knglish, 
syntactic  knowledge  is  able  to  predict 
those  places  where  such  function  words 
might  occur,  ant’,  in  many  cases,  further 
semantic  information  is  caoable  of 
predicting  whicti  function  words  are 
likely . 

d)  It  is  not  generally  possible  with  the 
current  estimated  level  of  performance  of 
the  acoustic  analyzer  to  distinguish 
correct  from  incorrect  word  matches  by 
acoustic  word  match  scores  alone.  When  a 
threshold  of  acoustic  match  quality  is  sot 
sufficientlv  low  to  accept  a  high 
proportion  of  the  correct  word  matches,  a 
large  number  of  acci tental  matches  of 
other  words  are  also  accented.  The  ratio 
of  extraneous  matches  to  correct  ones 
depends  on  the  setting  of  the  threshold 
(as  t)\e  threshold  is  relaxed  the  ratio 
gets  higher) ,  but  for  reasonable  settings 
it  may  he  on  the  order  o‘  2 0  to  1. 
Moreover,  it  appears  to  be  irpossihle  to 
set  the  threshold  sufficiently  low  to 
guarantee  accontance  of  all  correct  word 
matches  withon*-  swamping  the  system  with 
extraneous  accidental  matches.  However  in 
human  simulations,  although  it  required 
considerable  thrashing  around  ir.  difficult 
cases,  it  was  generallv  possible  to  qc, 
back  to  select*u)  regions  of  the  utterance 
after  partial  lexical,  semantic,  and 
svntactic  analvsis  and  perform  additional 


phonoloqica 1  and  phonetic  analysis  and/or 
word  matcliing  to  obtain  the  correct  words. 

A 1  though  we  are  attenuating  to  provide  such 
processes  in  our  system,  they  are  likely 
;_o  be  more  combinatoric  in  their  searching 
for  possibilities  than  .  the  human 
simulation.  It  is  far  too  early  to 
predict  the  success  of  their  performance. 

3)  The  process  of  infering  an  interpretation 

from  a  speech  siqnal  is  inherently 
non-determinir.tic  in  the  sense  that  it  is 
frequently  not  possible  to  make  a 
particular  decision  (such  as  which  of 
several  matching  words  is  the  correct  one 
at  a  qiven  position)  without  making  an 
assumption  and  following  out  its 
consequences  for  the  rest  of  the 
interpretation.  Mechanisms  must  be 

provided  for  following  out  all  of  the 
alternative  choices  in  order  to  find  the 
correct  interpretation. 

4)  No  a  priori  order  of  scanning  the 

utterance  (such  as  lef t-to-right)  for  word 
matches  and  syntactic  and  semantic 
processing  will  be  adequate  in  general 
since  any  given  word  mav  be  garbled  in  its 
oronunc iation  or  phonetic  analysis  and  we 
may  depend  on  the  successful  analysis  of 
the  rest  of  the  utterance  to  recover  the 
qartled  word.  Hence  classical 

lef t-to-right  parsers  will  not  suffice, 
nor  will  semantic  interpretation  rules 
such  as  those  in  LUNAR  which  are  indexed 
solely  under  the  head  of  the  construction 
being  interpreted  (the  head  of  the 
construction  may  oc  the  word  that  is 
garbled  and  wo  may  need  to  find  the 
successful  match  of  the  rest  of  the  rule 
in  order  to  infer  the  qarbled  word) . 

5)  The  snace  of  possible  alternative 

commutation  paths  which  could  lead  to  an 
interpretation  of  a  signal  is  too  vast  to 
be  searched  in  its  entirety.  In  fact  the 
set  of  possibilities  which  could  be  tried 
to  qet  an  interpretation  when  one  has  not 
fount)  one  yet  is  ooen-ended.  Examples 
include  relaxing  the  threshold  of 
acceptability  for  word  matches  in  the 
utterance  (or  in  portions  of  it) ,  trying 
the  next  best  acoustical  analysis  of  a 
given  segment  or  combination  of  them, 
looking  for  possible  alternative  ways  to 
segment  the  utterance  into  phoneme 
seuuences,  decidinq  to  accept  an 

interpretation  of  the  utterance  even 
though  it  is  not  syntactically 
well-*or"ied,  or  deciding  to  accept  an 
interpretation  which  is  not  semantically 
meaningful.  (I  heard  what  you  said  but  it 
doesn't  make  sense.)  Because  of  the 

openendedness  of  this  search  space,  it  is 
essential  to  devise  strategies  for 

searching  it  which  devote  their  effort  to 
the  regions  of  the  SDace  most  likely  to 
yield  the  best  interpretation  and  work  out 
"rom  there  toward  less  and  less  likely 
interpretations.  This  requires  the  use  of 
decision  criteria  to  evaluate  the  goodness 
of  a  word  match,  and  weigh  the 

alternatives  of  a  more  grammatical 
interpretation  with  poorer  word  matches 
against  a  sequence  of  better  word  matches 
which  doesn’t  parse  or  doesn’t  make  sense. 
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It  is  critical  to  know  the  difference 
between  reliable  and  unreliable  clues  and 
to  juggle  conpeting  alternative  partial 
interpretations  so  as  to  continually 
devote  effort  to  the  best  ones. 

6)  Even  with  stratenies  for  selectively 
pursuing  alternatives  according  to  their 
likelihood  of  success,  the  conijinatorics 
of  the  situation  arc  sucli  that  the  system 
will  be  swamped  with  alternative 
possibilities  unless  special  techniques 
are  used  to  keep  potentially  different 
alternatives  merged  for  processing 
operations  for  which  they  behave 
identically,  splitting  them  up  only  when 
an  operation  being  executed  has  a 
different  effect  for  the  different 
alternatives.  One  must  avoid  prematurely 
multiplying  combinations  of  cases.  For 
example,  one  cannot  afford  to  multiply  out 
all  of  the  possible  sequences  of  phonemes 
which  could  cover  the  utterance. 

The  system  which  wo  have  been  developing 
has  been  designed  to  meet  these  reguirements . 

Components  of  the  System 
Principal  Knowledge  Components 

As  a  consequence  of  examining  the 
protocols  and  results  of  the  Klatt  and 
Stevens  experiments  it  was  apparent  that 
their  performance  was  based  on  the 
capabilities  of  at  least  6  conceptually 
distinguishable  components 

1)  an  acoustic  feature  extraction  component 
which  performs  the  equivalent  of  a 
first-pass  segmentation  and  labelinq  of 
the  acoustic  signal  into  partial  phonetic 
descriptions,  probably  taking  into  account 
knowledge  of  phonological  rules. 

2)  a  lexical  retrieval  proqram  which,  on  the 
basis  of  knowledge  of  the  vocabulary  and 
partial  phonetic  descriptions,  retrieves 
words  from  the  lexicon  to  be  matched 
against  the  input  signal. 

3)  a  word  verification  component  which,  given 
a  particular  word  and  a  particular 
location  in  the  input  signal,  determines 
the  degree  to  which  the  word  matches  the 
signal. 

4)  a  syntactic  component  which  is  capable  of 
judging  grammaticality  of  an  hypothesized 
interpretation  of  the  signal  and  of 
proponing  words  or  syntactic  categories  to 
extend  a  partial  interpretation. 

5)  a  semantic  component  which  is  capable  of 
noticing  coincidences  between  semantically 
related  words  which  have  been  found  at 
different  places  in  the  signal,  judging 
the  meaningfulness  of  an  hypothesized 
interpretation,  and  predicting  particular 
words  or  specific  classes  of  words  for 
extending  a  partial  interpretation. 

6)  a  pragmatic  component,  which  is  capable  of 
making  judgments  and  predictions  as  to  the 
pragmatic  likelihood  of  a  given  sentence 
being  uttered  by  the  sneaker,  taking  into 
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account  whatever  is  known  about  the 
speaker  and  the  situation. 

In  addition  to  these  6  components  which 
correspond  to  some  extent  to  different 
sources  of  knowledge  that  go  into  the 
determination  of  the  preferred 

interpretation,  there  is  clearly  an 
additional  component  of  a  different  sort  -- 
namely  the  decision  process  itself.  In  this 
component,  which  we  have  called  the  control 
component,  reside  the  strategics  for  infering 
an  interpretation  of  the  utterance,  dealing 
with  questions  such  as: 

•c  should  one  look  for  word  matches  first? 

how  much  partial  phonetic  information  is 
given  as  input  to  the  .exical  retrieval 
routine? 

how  good  a  word  match  score  is  required  for 
the  word  to  be  given  further 
consideration? 

how  and  at  what  points  docs  one  use  syntactic 
and  semantic  information  to  influence  the 
interpretation? 

how  arc  alternative  possible  interpretations 
formed,  managed,  and  resolved? 

when  should  one  temporarily  abandon  a  given 
region  of  the  utterance  to  concentrate  on 
another  region? 

what  information  might  be  found  elsewhere 
that  might  help,  ami  how  can  it  be  used? 

These  am'  myriad  other  questions  have  answers 
(not  necessarily  ODtimal)  embedded  in  the 
procedures  used  by  the  human  experts  to 
interpret  the  spectrograms  in  the  Klatt  and 
Gtevcns  experiments.  Wc  need  to  capture 
similar  strategies  in  the  control  component 
of  our  speech  understanding  system. 

The  Control  Component 

Clearly  the  strategies  embedded  in  the 
control  component,  critical  to  the  success  of 
the  system,  are  far  from  obvious.  We  have 
attempted  to  arrive  at  a  reasonable  set  of 
such  strategics  bv  drawing  on  intuitions 
developed  in  incremental  simulations.  These 
strategics  are  being  continually  refined  and 
extended  as  wc  gain  more  experience  with  the 
evolving  SPEECMLIS. 

The  function  of  the  control  component 
centers  around  the  creation,  refinement,  and 
evaluation  of  formal  data  objects  called 
"theories*,  which  represent  alternative 
hvpotheses  about  the  utterance  being 
interpreted.  A  theory  contains  the  words 
hypothesized  to  be  in  the  utterance  and  where 
thev  match,  semantic  hypotheses  about  how 
those  words  relate  to  each  other,  hypotheses 
about  syntactic  structure,  and  various  scores 
reflecting  the  "likelihood*  of  the  theory 
from  different  points  of  view  (lexical  match 
quality,  semantic  completeness,  syntactic 
correctness,  etc.).  These  theories  generally 
represent  only  partial  hypotheses,  beginning 
with  single  word  theories  with  little  or  no 
syntactic  or  semantic  detail,  constructing 
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larger  theories  by  refinement,  and  eventually 
building  up  to  complete  theories  representing 
hvp'  theses  for  a  sequence  of  words  covering 
the  entire  utterance  with  connlete  syntactic 
structure  and  serantic  interpretation.  The 
task  of  the  control  component  is  to  manage 
the  creation  and  refinement  of  those 
theories,  devotina  its  resources  to  expanding 
those  theories  whicli  look  best  according  to 
their  various  scores  until  one  or  more 
complete  theories  with  acceptable  scores  are 
found.  Control  passes  partial  theories  at 
various  times  to  the  syntactic  and  semantic 
components,  which  return  them  with  evaluation 
scores  or  susnend  them,  after  creating 
monitors  for  everts  (which  could  cause  the 
refinement  of  a  theory)  and  making  proposals 
for  word  matches  (whicli  Control  should  recall 
the  word  matcher  to  look  for) .  Monitors 
behave  as  active  "demons"  to  give  notices  to 
control  whenever  events  of  the  type  which 
they  are  loo). ing  for  occur.  Each  monitor 
remembers  the  theory  which  set  it  and  a 
procedure  which  is  to  be  executed  to 
assimilate  the  event  that  triqgers  the 
monitor.  The  result  of  executing  this 
procedure  will  be  a  r.ew  refined  theory  which 
may  itself  set  additional  monitors  and  make 
proposals . 

In  the  next  few  sections,  we  will 
describe  in  a  little  more  detail  the  various 
components  of  the  current  system.  The  scope 
of  the  current  paper,  however,  will 
necessarily  renuire  these  descriptions  to  be 
brief.  Tor  more  detailed  descriptions  of  the 
individual  components  the  reader  is  again 
referred  to  the  individual  papers  in  this 
volume  11,6,7,0,9). 

Acous tic- Phone tic  and  Phonological  Ana  lysis 

In  the  acoustic  end  of  our  system,  the 
speech  signal  is  samoled  at  2'J  kHz  and  stored 
on  a  disc  file.  All  subsequent  analysis  is 
performed  on  the  digitized  signal.  Using  our 
recently  developed  method  of  "selective 
linear  prediction"  (3,4)  we  perform  a  linear 
predictive  ( L.P )  analysis  on  the  0-0  kHz 
region  of  the  spectrum.  Presently,  almost 
all  our  parameters  are  based  on  that  portion 
of  the  spectrum,  the  exception  being  a 
parameter  giving  the  snoctral  energy  between 
5-11)  kHz,  which  is  used  for  detection  of 
frication.  The  parameters  used  in  our 
segmentation  and  feature  extraction  are  based 
on:  energy  of  the  signal,  energy  of  the 
differenced  signal,  low-frequency  energy,  the 
first  autocorrelation  coefficient,  the 
normalized  LP  error,  energy-sensitive  and 
energy-insensitive  soectral  derivatives, 
fundamental  frequency,  frequencies  of  a 
two-pole  LP  mode)  i 5 ]  and  poles  of  a  14-pole 
LP  model.  We  have  developed  an  initial  sot 
of  algorithms  for  the  nondeterministic 
segmentation  o*  the  utterance  into  a  feature 
or  segment  lattice.  Associated  witli  each 
seqment  boundary  are  confidence  measures  that 
reflect  the  likelihoods  of  that  point  in  the 
utterance  being  a  segment  boundary  and  of  it 
being  a  word  boundary.  Another  sot  of 
algorithms  performs  a  feature  analysis  on 
each  of  the  segments.  We  have  concentrated 
thus  far  on  the  recognition  of  manner  of 
articulation,  e.g.  vowel,  nasal,  lateral, 
retroflexed,  plosive,  fricative, 


voiced/unvoiced.  The  only  place  of 
articulation  recognition  that  we  do  is 
performed  on  the  vowels  and  strident 
fricatives.  Confidence  estimates  for  each  of 
the  features  and  for  the  entire  segment  are 
also  given. 

« 

The  output  of  the  acoustic-phonetic 
analysis  is  in  the  form  of  a  segment  lattice, 
an  example  of  which  is  illustrated  in 

Figure  1.  It  compactly  represents  all  of  the 
possible  alternative  segmentations  of  the 
utterance  and  the  alternative  identities  of 
t,.e  individual  segments.  This  lattice  is 
processed  by  a  phonological  rule  component 
which  augrents  the  lattice  with  segments  for 
possible  underlying  sequences  of  phonemes 
which  could  have  resulted  in  the  observed 
acoustic  sequences.  We  associate  with  each 
added  branch  a  predicate  function  which  is 
later  used  by  the  word  matcher  to  check  for 
the  applicability  of  the  given  phonological 
rule  based  on  the  specific  word  spellinq  and 
the  necessary  context.  In  this  manner,  the 
phonoloqical  rules  are  both  analytic  and 
partially  generative.  Other  generative  rules 
can  be  applied  ahead  of  time  to  the 
dictionary  phonemic  spellings  of  words 
such  rules  have  been  done  manually  in  our 
current  system. 

Higher  Level  Linguistic  Constraints 

The  current  lexical  retrieval  and  word 
matching  component  makes  use  of  a  phonetic 
similarity  matrix  for  evaluating  non-exact 
phoneme  matches,  phonologically  motivated 
deletion  likelihoods  for  each  of  the  phonemes 
in  a  word,  and  rudimentary  duration  cues 
based  on  stress  marks  in  the  phonemic 
spelling  of  the  word.  Words  with  three  or 
more  phonemes  which  score  above  a  threshold 
of  matok  quality  are  placed  in  a  "word 
lattice,"  an  example  of  which  is  illustrated 
in  Figure  2.  They  are  given  individually  to 
the  semantic  component  which  constructs  a 
one-word  theory  for  each  content  word, 
monitors  for  words  that  could  be  semantically 
related  to  the  given  one,  and  generates 
events  for  each  detected  coincidence  between 
two  or  more  semantically  related  words  or 
concepts.  Each  word  is  also  checked  for 
matchinq  inflectional  endings,  and  verbs  are 
checked  for  possible  auxiliaries  to  their 
loft  and  at  the  beginning  of  the  utterance. 

The  semantic  coincidence  events  are 
sorted  by  the  control  component  in  order  of 
their  likelihood  scores  and  at  appropriate 
times  are  returned  to  Semantics  for  the 
construction  of  larger  theories.  In  this 
way,  multiple  word  theories  are  constructed 
which  consist  of  semantically  related  content 
words  which  match  well  acoustically.  When  a 
theory  becomes  maximal  (i.e.,  Semantics  has 
no  further  words  to  add  to  it) ,  it  is  passed 
to  Syntax  for  syntactic  evaluation.  In 
addition  to  evaluation,  Syntax  picks  up 
further  words  from  the  word  lattice  and 
proposes  words  (especially  function  words)  to 
fill  the  gaps  between  the  words  originally 
provided  in  the  theory.  Syntax  also  monitors 
for  syntactic  categories  of  words  which  it 
could  use  to  fill  gaps.  When  Syntax 
completes  a  constituent  (such  as  a  noun 
phrase)  it  calls  Semantics  directly  to  verify 
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the  consistency  betv;een  the  syntactic 
structure  of  the  constituent  anti  the  sonantic 
hypotheses  for  its  words. 

The  control  strategy  maintains  a  list  of 
active  theories,  pendinq  events,  and  proposed 
words  and  classes  --  all  ordered  by  estimates 
of  likelihood  --  and  determines  which 
theory/event/proposal  to  work  on  next  at  each 
point. 

Some  pragmatic  inferences  have  been 
identified  and  embedded  in  the  control 
strategy,  but  no  systematic  pragmatics 
component  has  been  incorporated.  The 
construction  of  semantic  procedures  for 
answering  questions  using  the  data  base  has 
not  yet  been  implemented  since  we  have 
previously  done  this  once  with  the  LbflAH 
system  and  have  been  devoting  our  effort 
instead  to  the  new  aspects  of  the  system. 

Preliminary  results  obtained 

Since  the  current  phase  of  the  BUM 
speech  project  is  more  concerned  with  finding 
the  problem  areas  and  developing  possible 
solution  techniques,  it  is  premature  to 
expect  statistical  results  such  as  percentage 
of  utterances  successfully  understood. 
Rather,  the  principal  product  of  the  research 
at  this  point  consists  of  experiences  that 
suggest  experiments  yet  to  be  done  and 
techniques  whose  effectiveness  has  yet  to  be 
fully  measured.  The  following  are  some 
examples : 

The  inclusion  in  the  word  matching  function 
of  simple  duration  checks  for  stressed 
phonemes  and  of  deletion  probabilities  for 
each  phoneme  decreased  the  scores  of  many 
of  the  accidental  word  matches  without 
effectively  loworinq  the  scores  of  the 
correct  word  matches.  This  suggests  a 
host  of  experiments  --  how  much 
improvement  can  you  obtain?  —  with  what 
cost? 

The  ambiguities  of  segmentation  and 
labeling  of  the  acoustic  signal  can  result 
in  the  same  word  matching  the  innut  signal 
in  approximately  the  same  place  in  several 
different  ways  with  slightly  different  end 
points  and  sliqhtly  different  scores. 
From  the  point  of  view  of  the  semantic 
associations  invoked,  these  word  matches 
are  all  the  same  and  should  not  be  dealt 
with  by  separate  theories,  one  for  each 
such  match.  This  has  resulted  in  the 
creation  of  a  "fuzzy  word  match"  which 
lumps  together  equivalent  word  matches 
into  a  single  entity  which  is  dealt  with 
by  Semantics  as  a  Si.nqlc  word  match  with 
ambiguous  cm!  points.  This  greatly 
reduces  the  number  of  theories  processed. 

A  similar  phenomenon  occurs  when  several 
words  from  a  sinqlc  semantic  class  all 
match  the  signal  at  the  same  point  (for 
example  the  pronouns  "I",  "we",  and  "us"). 
Again,  since  Semantics  will  initially  do 
the  same  thing  for  each  such  word,  these 
are  grouped  together  into  a  "clump"  which 
is  treated  as  a  single  word  until  such 
time  as  later  processing  splits  it  up. 


Certain  acoustic-phonetic  facts  which  are 
not  currently  dealt  wit])  by  the  segmenting 
and  labeling  component  can  cause 
recognizable  pathologies  at  later  stages 
of  processing.  For  example,  the  fact  that 
voicing  frcauentlv  droos  out  before  the 
end  of  frication  in  a  voiced  fricative 
followed  by  an  unvoiced  plosive  may  cause 
the  segmenter  to  recognize  a  segment 
sequence  (z](k]  as  a  sequence  (*](•]  (k) 
causing  word  matches  for  "samples"  and 
"contain"  which  should  be  adjacent  to  have 
a  spurious  (si  segment  between  them.  This 
problem  could  be  dealt  with  either  by 
improving  tne  initial  segmentation  and 
labeling  algorithm,  or  by  an  analytic 
phonological  rule  to  combine  the  voiced 
and  unvoiced  fricative  in  this  context 
into  a  single  voiced  fricative,  or  by  a 
higher  level  word  adjacency  tost  which 
considers  two  words  to  be  adjacent  if  a 
spurious  seqment  between  them  can  be 
accounted  for  as  an  expected  transition 
seqment.  This  suggests  experiments  to  be 
performed  when  the  system  is  more  fully 
developed  to  determine  the  most  effective 
place  to  deal  with  this  and  similar 
problems . 

It  is  possible  to  get  alternative 
interpretations  with  almost  equally  good 
lexica),  syntactic,  and  semantic 
evaluations  --  even  two  interpretations 
with  exactly  the  opposite  meaning.  In  all 
such  situations  which  we  have  witnessed, 
there  has  been  other  information  (such  as 
prosodic  or  pragmatic  information) 
available  to  make  a  choice,  but  it  seems 
clear  that  the  information  which  could  be 
so  used  is  open  ended,  and  it  is  not  clear 
how  much  is  required  in  order  to  get 
acceptable  performance  even  for  our  2j0 
word  vocabulary,  mucii  less  a  1CCC  word 
vocabularv. 

The  list  of  such  questions  which  are 
1  ing  raised  could  qo  on  and  on,  and  would 
not  be  worth  enumerating.  However,  the  above 
list  should  be  suggestive  of  the  types  of 
results  which  we  hope  to  obtain. 

A  Sample  of  Current  Performance 
Issues  of  evaluation 

We  have  outlined  th"  methodology  and  the 
current  state  of  a  project  to  develop 
advanced  speech  understanding  capabilities  by 
a  process  of  continual  incremental 
improvement  of  a  svstem  with  initially  crude 
capabilities  in  each  of  its  individual 
components.  An  important  consideration  for 
such  a  program  is  a  method  for  evaluating  the 
progress  of  this  evolutionary  development  in 
terms  of  the  performance  of  the  system  or  of 
its  parts.  How  docs  one  measure  the 
improvement  (or  degradation)  in  svstem 
performance  caused  bv  a  particular  change  to 
a  strategy  in  one  of  the  components?  Although 
our  current  system  has  not  yet  reached  the 
stage  where  we  arc  prenared  to  run  many 
utterances  through  it  to  compute  statistics 
of  performance,  we  have  qiveit  some  thought  to 
what  statistics  of  performance  one  would  like 
to  see  and  have  made  some  initial 
measurements  of  them  on  test  sentences. 
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Evaluation  parameters  fall  into  two 
classes,  measures  of  precision  and  measures 
of  accuracy.  For  oxanole,  in  evaluating  the 
performance  of  the  segment  labeler,  precision 
measures  the  doqree  to  wnich  the  label 
assiqnod  uniquely  specifies  the  phonemic 
identity  of  the  segment,  while  accuracy 
measures  tho  froouency  with  which  the 
description  is  correct.  There  is  clearly  a 
tradeoff  between  those  two  measurements  since 
one  can  achieve  perfect  accuracy  by  relaxina 
precision  to  the  point  where  the  description 
assigned  is  sufficiently  vaoue  to  include  all 
of  the  phonemes.  On  the  other  hand,  one 
could  only  achieve  perfect  precision  hy 
choosinq  at  every  point  the  sinqle  most 
likely  phoneme  with  a  subsequent  loss  of 
accuracy.  There  are  similar  measures  of 
piacision  and  accuracy  for  tho  orocoss  of 
segmentation  itself  (as  ooposed  to  labeling) 
and  the  process  of  lexical  retrieval  and 
matching. 

As  a  measure  of  precision  in 
segmentation,  wo  may  take  the  branching  ratio 
of  the  segment  lattice,  i.e.  the  number  of 
seqments  per  boundary.  Accuracy  in 
segmentation  falls  into  two  categories  —  the 
number  of  missing  boundaries  (i.e.  segment 
boundaries  which  were  not  identified  as 
potential  boundaries  in  the  lattice)  and  the 
number  of  extra  boundaries  (i.e.  points  in 
the  utterance  identified  as  boundaries  in  the 
lattice  which  were  not  segment  boundaries  and 
for  which  there  is  no  “bridging"  segment 
crossing  that  region  or'  the  utterance). 

Spoeifie  precision  and  accuracy  measures 
for  segment  labeling  are  tho  average  number 
of  phonemes  per  label  (i.e.  the  number  of 
phonemes  subsumed  under  the  description 
assiqnod  to  a  segment)  and  the  average 
percentage  of  errors  in  lnbclinq  (when  the 
correct  phoneme  is  not  subsumed  in  the 
assigned  description) . 

At  the  lexical  level,  we  can  measure  the 
success  of  the  initial  lexical  retrieval  pass 
in  terms  of  the  number  of  correct  words  found 
(out  of  the  total  number  of  correct  words  to 
be  found  —  an  accuracy  measure)  and  the 
"stray  word  ratio*  (the  ratio  of  the  total 
number  of  words  found  to  the  number  of 
correct  words  found  --  a  precision  measure). 


the  result  of  a  human  soectrogram  reading  as 
in  the  first  phase  of  the  Klatt  and  Stevens 
experiments.  The  second  case  (autol)  is  the 
result  of  our  first  crude  seomentinq  and 
labeling  program  which  estimates  only  the 
manner  of  articulation  of  the  seqments  and 
does  not  measure  place  of  articulation.  The 
third  case  (auto2)  makes  use  of  a  sliqhtly 
improved  version  (but  still  crude)  of  the 
segmenting  and  labelinq  program,  which  tracks 
formants  and  estimates  place  of  articulation 
for  vowels.  At  the  bottom  of  Fiqure  3  is 
shown  the  word  match  score  assigned  bv  the 
lexical  retrieval  component  to  each  of  the 
correct  words  that  it  found.  We  did  not  run 
it  on  the  auto 2  lattice  for  DWu-29. 


Our  current  front-end  analysis  component 
tends  to  be  better  at  some  kinds  of  phonetic 
events  than  at  others.  This  is  a  result  of 
tlie  almost  encyclopedic  amount  of 
acoustic-phonetic  and  phonological  knowledge 
which  is  reouired  to  deal  with  the  different 
phenomena  which  can  occur  and  the  relatively 
short  amount  or  time  which  we  have  had  to 
embodv  this  knowledge  in  computer  algorithms. 
This  difference  is  illustrated  by  the 
differences  in  performance  between  the  two 
utterances  DUD- 18  ("Have  any  people  done 
chemical  analyses  on  tills  rock?")  and  DHD-29 
("give  me  all  lunar  samDles  with 
magnetite.*).  The  former  seems  to  contain 
only  phenomena  with  which  the  current 
programs  deal  reasonably  well,  while  the 
latter  contains  such  troublesome 
configurations  as  the  "all  lunar"  seoucnce. 
In  DWD- 18,  the  performance  of  the  auto2 
acoustic  analyzer  is  superior  to  that  of  the 
manual  analysis  in  terms  of  the  precision  and 
accuracy  measures,  but  itB  errors  are 
slightly  different  from  those  of  the  manual 
analysis,  and  in  particular,  its  resulting 
transr ription  is  such  that  the  "peoole"  word 
match  which  was  found  on  the  manual  analysis 
was  missed  for  autol  and  auto2.  This  is  due 
to  the  effect  of  a  phonological  rule  which 
the  human  apparently  took  into  account  in  his 
analysis  but  which  the  mechanical  analysis 
component  did  not  know  about.  The 
phonological  rule  component  which  has  been 
implemented  since  these  experiments  were  run 
is  capable  of  recovering  this  match. 


Clearly  there  are  precision/accuracy 
tradeoffs  tnroughout  tho  system,  r.v  merely 
adjusting  the  threshold  of  acceptable  word 
match  quality,  the  nunbor  of  correct  words 
found  and  the  stray  word  ratio  can  be  altered 
without  any  change  at  all  in  the  algorithm 
being  used  for  word  matching. 


While  we  have  not  performed  the 
necessary  experiments  to  be  ahlo  to  qive  any 
conclusions  about  the  behavior  of  these 
parameters  as  a  function  of  differences  in 
strategies,  threshold  levels,  etc.,  and  while 
the  current  components  qive  only  crude 
approximations  to  the  per fornance  whiqh  we 
expect,  we  havo  conducted  a  few  testB  which 
ray  serve  as  benchmarks.  Figure  3  gives  the 
results  of  some  tests  (made  in  October,  1J73) 
on  two  utterances  using  three  different 
acoustic  analvsis  methods  to  produce  the 
segment  lattices.  The  first  case  (manual)  is 


Performance  of  Syntax  and  Semantics 

For  the  higher  level  components  of 
Syntax  and  Semantics  the  same  types  of 
precision  and  accuracy  measurements  no  longer 
seen  appropriate  until  one  has  processed 
laroe  numbers  of  utterances  and  recorded  the 
success  rate;  and  even  then,  there  is  no 
natural  notion  of  a  precision  measure. 
Questions  of  interest  in  the  syntactic  and 
semantic  areas  of  the  system  include;  how 
much  effort  is  devoted  to  searching  blind 
alleys  before  a  correct  interpretation  of  the 
utterance  is  found?,  how  manv  false 
interpretations  are  accented  in  addition  to 
(or  before)  tho  correct  one?,  is  the  correct 
one  found  at  all?,  etc. 

While  we  do  not  begin  to  have  answers  to 
these  questions,  we  have  rur  test  cases  which 
can  Bervc  as  benchmarks.  We  will  illustiate 
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EXAMPLE  OF  PERFORMANCE  OF  ACOUSTIC-PHONETIC  PROCESSING 
AND 

LEXICAL  RETRIEVAL  SCAN  FOR  "GOOD"  "BIG"  WORDS 


DUD-18 _ _ _ DWD-29 


#  segs  in  ideal 
segmentation 

_ ifllilZAg _ 

IDEAL  MANUAL  AUT01  AUTO  2 

39 

IDEAL  MANUAL  AUT01  AUT02 1 

27 
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with  a  brief  sunmary  of  the  syntactic  and 
semantic  processinq  of  a  sentence  DWD24  ("How 
many  samples  contain  silicon?")  from  a 
segment  lattice  obtained  by  mechanical 
segmentation  and  labeling.  (Two  editing 
changes  were  made  to  the  lattice  to  manually 
simulate  the  effects  of  phonological  rules.) 

In  the  initial  lexical  retrieval  scan  of 
the  segment  lattice  for  this  sentence,  word 
matches  for  "sample",  "contain",  and 
"silicon"  were  found  with  acceptable  acoustic 
scores,  together  with  a  number  of  other 
accidental  word  matches  such  as  "contain"  (in 
another  place  in  the  input),  "occur", 
"occurinq",  "with",  "content",  "contents", 
and  many  others.  In  the  formation  of 
one-word  theories,  4  different  matches  of 
"contain"  were  combined  into  a  single  fuzzy 
word  match,  4  matches  for  "samples"  and  two 
for  "sample"  were  c.  niiined  into  another 
single  fuzzy  match,  and  a  number  of  other 
fuzzy  word  matches  and  semantic  "clumps" 
occurred.  Monitors  placed  bv  Semantics 
durinq  processino  of  one-word  theories 
detected  coincidences  between  "samples"  and 
"occur ( inq) " ,  between  "contain"  and 
"silicon",  between  "sample(s)"  and  "contain", 
and  others.  These  events  were  ordered  by 
their  scores  as  assiqned  by  the  control 
component  and  the  first  two-word  theory 
created  was  for  "samples  occur (ing)"  (theory 
121).  The  second  two-word  theory  was  for 
"sample(s)  contain"  (theory  #22)  and  the 
third  for  "contain  silicon"  (theory  *23). 
There  was  also  a  theory  for  "sample  (s)“  and 
the  other  word  match  for  "contain*  (theory 
#25).  Theory  #22  ("sample (s)  contain") 
detected  the  match  for  "silicon"  and  produced 


theory  #20  ("snmple(s)  contain  silicon"). 
Also  theory  #23  ("contain  silicon")  detected 
the  word  match  for  "sarnnle  (s) " ,  but  it 
refrained  from  creatinq  a  duplicate  of  theory 
#26  after  detcctinn  its  presence.  Theory  #26 
was  then  passed  to  Syntax  for  verification 
and  further  prediction. 

The  word  matches  for  theory  #26  form  a 
contiguous  sequence  of  words  from  position  6 
in  the  sional  (6fl  ms  from  the  beginning  of 
the  utterance)  to  the  end,  and  Syntax  was 
able  to  parse  this  sequence  without  knowing 
the  word  matches  which  occurred  at  the 
beginning  of  the  sentence.  After  parsinq  the 
words  that  it  was  given.  Syntax  noticed  word 
matches  already  in  the  word  lattice  for 
"many"  and  "any"  ending  at  position  6, 
proposed  "much"  and  "there"  and  syntactic 
classes  DET  (detorniner)  and  PREP 
(preposition),  all  ending  at  position  6.  It 
also  set  monitors  at  position  6  looking  for 
the  classes  ADJ.  ORD,  DET,  N,  V,  NKG,  and 
PREP. 

The  notice  for  “ar.y"  from  Syntax  for 
theory  #2G  resulted  in  a  new  theory  for  “any 
samples  contain  silicon"  (theory  #30),  which 
detected  the  word  "give"  to  its  le.ft. 
However,  Syntax  rejected  "give  any  samples 
contain  silicon"  as  being  unqrammatical .  The 
notice  for  "many"  combined  with  theory  #26  to 
qive  theory  #31  ("many  samples  contain 
silicon"),  whicli  in  turn  noticed  several 
words  ending  at  the  left  end  of  "many" 
including  the  word  "how".  The  scores  of  the 
words  and  the  strategies  applied  by  Control 
are  such  that  the  3Hth  theory  formed  was  th** 
complete  analysis  "how  many  samples  contain 
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In  the  process  of  this  computation. 
Semantics  had  placed  48  monitors  of  various 
vpos  on  specific  words  and  concepts  in  the 
semantic  network.  There  were  48  events 
(resulting  from  notices  from  monitors)  loft 
unprocessed  on  the.  event  queue  and  an  unknown 
number  of  potential  events  which  could  have 
been  noticed  if  processing  were  continued. 
Syntax  had  created  1U4  configuratio  n  and  144 
transitions  in  its  internal  svntax  tables, 
and  set  51  monitors  on  positions  in  the  word 
lattice. 

Notice  that  although  the  potential 
search  space  is  vast,  and  the  control 
rnechanisri  is  sot  up  to  systematically  cover 
the  entire  space  (if  necessary)  lookinq  for 
an  interpretation  of  the  utterance,  the  order 
of  processina  theories  is  such  that  wo  have 
found  the  correct  analysis  at  a  verv  early 
stage  of ^  the  search,  loavinq  the  vast 
majority  o  r  the  computations  on  other  paths 
undone.  H 


Future  Developments 

.  As  a  consequence  of  further  experience 
with  the  gradually  evolving  SPmcilLin  and 
further  thought  on  the  matter,  it  is  clear 
that  we  could  benefit  greatly  from  a 
component  presumably  not  used  by  Klatt  and 
Stevens  in  their  experiment.  This  is  a 

prosodic  component  which  knows  the  required 
relationships  between  syntactic  structure  and 
meaninq,  on  the  one  hand,  and  the  intonation 
contour  and  stress  patterns  of  a  speech 
utterance,  on  the  other.  When  one  considers 
the  inherent  ambiguity  of  the  speech 

utterance  which  is  entailed  by  the  loss  of 
word  and  phoneme  boundaries  and  the  relative 
uncertainty  of  identification  of  the 

elementary  units  of  phonetic  "snolling",  and 
when  one  contrasts  this  with  the  fact  that 
sentences  read  aloud  are  capable  of  resolvinq 
syntactic  anbinuitics  which  are  not 

resolvable  in  written  form,  it  is  clear  that 
some  additional  information  must  he  present 
in  the  spoken  utterance  beyond  a  mere 
sequence  of  vaguelv  blurred  sounds.  It 

appears  that  this  acWitional  information  is 
provided  in  the  suhtlo  variations  in  pitch, 
energy,  and  segment  duration  which  are 

present  in  the  snoker  utterance  and  which 
seeminqly  relate  the  speech  signal  directly 
to  the  syntactic  structure  of  the  utterance. 
Aitnouqh  not  present lv  a  part  of  SPnnciILIS, 
we  plan  to  include  such  a  component  in  the 
system  in  the  near  future.  It  is  anticipated 
that  su  h  information  will  greatly  reduce  the 
number  of  possible,  syntactic  analysis  paths 
which  must  be  considered  in  the  current 
system. 

Another  development  planned  for  the 
future,  and  on  which  we  are  now  working,  is  a 
much  more  sophisticated  word  verification 
component.  This  component  will  take  a  word 
match  proposed  by  lexical  retrieval  or  other 
sources,  which  has  passed  the  tests  of  the 
current  wjrd  matching  component,  ard  will 
perform  a  tvpe  of  annlvsis-bv-syndiosis 
derivation  of  the  detailed  behaviour  of 
formants,  transitions,  etc.  This  will  then 
he  compared  against  the  acoustic  analysis 


parameters  of  the  speech  signal  to  obtain  a 
more  reliahle  word  match  score  than  that 
currently  obtained.  We  expect  this  component 
to  greatly  reduce  the  number  of  accidental 
word  matches  accepted  for  consideration  by 
the  higher  level  components. 

Conclusions 

Wo  have  presented  a  brief  overview  of 
the  various  components  of  the  BBN  speech 
understanding  system  together  with  a 
motivation  for  the  structure  of  the  system, 
the  required  capabilities  of  the  individual 
components,  and  a  brief  description  of  how 
they  work.  More  detailed  descriptions  of  the 
individual  components  are  contained  in 
separate  papers  11,6,7,8,9].  The  components 
of  the  current  system  are  but  crude 
approximations  of  the  components  which  we 
plan  to  evolve,  but  they  have  been  assembled 
into  a  total  system  in  their  current  state  in 
order  to  study  their  interactions.  We 
believe  that  the  development  of  the 
individual  components  will  be  more  effective 
and  the  results  more  realistic  if  their 
development  is  done  in  the  context  of  a  total 
system  rather  than  in  isolation,  and  our 
experience  so  far  bears  this  out.  The 
project  is  now  in  a  state  where  the 
interaction  betv/een  the  people  working  on 
acoustic  analysis  and  those  working  on 
lexical  retrieval  and  word  matching  as  they 
try  to  make  their  components  fit  toqcther  has 
resulted  in  improvements  to  both  sides,  and 
this  appears  to  be  a  continuing  process. 

A  central  issue  of  the  BBN  speech 
project  is  to  gain  insight  into  the  ways  in 
which  the  higher  level  linguistic  components 
interact  with  the  acoustic-phonetic  ard 
phonological  components  in  the  overall  speech 
understanding  process  and  to  develop 
techniques  for  making  this  happen  efficiently 
in  mechanical  speech  understanding  systems. 
Wo  are  especially  concerned  with  discovering 
techniques  which  will  be  capable  of  dealing 
with  a  large  vocabulary,  a  fluent  English 
syntax,  and  a  diversified  ranoe  of  semantic 
concepts,  rather  than  attempting  to  optimize 
performance  for  small  vocabularies  and 
restricted  syntax  and  semantics.  We  are 
concerned  with  finding  the  limits  where 
increased  vocabulary  size,  increased  fluency 
of  languaqo,  and  increased  ranqe  of  semantic 
diversity  cannot  be  handled  bv  increased 
reliability  in  acoustic-phonetic  and 
phonological  analysis  and  word  verification. 
Although  the  current  capabilities  of  our 
system  are  but  suggestive  promises  of  what  is 
to  core,  we  think  that  the  behaviour  of  this 
minimal  system  on  test  sentences  amply 
illustrates  the  potential  power  of  the 
techniques  which  we  have  described.  The  full 
assessment  of  their  capabilities  must  however 
await  further  development  and  testing. 
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Abstract 


Errors  in  acoustic-phonetic  recognition 
occur  not  only  because  of  the  limited  scope 
of  the  recognition  alnorithm,  but  also 
because  certain  ambiguities  are  inherent  in 
analyzing  the  speech  sional.  Examples  of 
such  ambiguities  in  segmentation  and  labeling 
(feature  extraction)  are  given.  In  order  to 
allow  for  these  phenomena  and  to  deal 
effectively  with  acoustic  recognition  errors, 
we  have  devised  a  lattice  representation  of 
the  segmentation  which  allows  for  multiple 
choices  that  can  be  sorted  out  by  higher 
level  processes.  A  description  of  the 
current  acoustic-phonetic  recognition  program 
in  the  BBN  Speech  Understanding  System  is 
given,  along  with  a  specification  of  the 
parameters  used  in  the  recognition. 

Introduction 

One  approach  to  automatic  speech 
recognition  begins  the  recognition  process  by 
attempting  to  divide  the  utterance  into 
segments  which  are  hypothesized  to  be  single 
phonemes.  The  identity  of  each  segment  is 
then  partially  or  completely  determined  by 
feature  extraction  or  LABELING.  Since 
segmentation  and  labeling  are  interdependent, 
the  above  process  must  be  iterated  to  obtain 
reasonably  accurate  recognition.  In  this 
approach,  se jmentation  errors  such  as  missing 
and  extra  segments  will  arise  not  only 
because  of  the  limited  nature  of  an  automatic 
algorithm,  but  also  because  of  the  inherent 
ambiguity  of  the  acoustic  signal.  In 
general,  it  is  not  possible  to  identify 
segment  boundaries  with  absolute  certainty, 
nor  is  one  sure  of  the  exact  phoneme  that  the 
segment  represents  [1,2,4].  Klatt  and 
Stevens  (3)  have  illustrated  the  types  of 
acoustic  variation  that  a  single  word  can 
undergo  depending  on  the  context.  Such 
variations  can  lead  to  segmentation  and 
labeling  errors  if  the  only  source  of 
knowledge  available  is  the  acoustic  signal. 
In  this  paper  we  shall  illustrate  the  types 
of  ambiguities  that  exist  in  analyzing  a 
speech  signal,  and  then  outline  the  method  we 
have  adopted  to  deal  with  this  problem  in  the 
BBN  Speech  Understanding  System  (SPEECHLIS) 
[9J.  In  addition,  we  give  a  brief 
description  of  our  current  acoustic-phonetic 
recognition  program  (APR) . 

Ambiguities  ir  -he  Speech  Signal 

Below  are  a  <-w  examples  that  illustrate 
the  types  of  ;  iniquities  that  are  found  in 
the  speech  signal. 

a)  A  short  dip  in  energy  can  be  interpreted 
in  several  ways.  For  example,  fricatives 
often  have  a  short  dip  in  energy  at  the 
start  and  end  of  frication.  Also,  a  short 


nasal  is  often  marked  by  a  short  drop  in 
eneroy.  Therefore,  a  dip  in  energy 
between  a  vowel-like  sound  and  a  fricative 
could  be  just  a  seoment  boundary,  or  a 
short  nasal  as  in  the  word  "ansver". 

b)  A  silent  seoment  followed  by  a  noisy 
segment  can  be  either  a  plosive  followed 
by  a  fricative,  or  the  whole  sequence  can 
be  an  aspirated  plosive. 

c)  Certain  formant  transitions  can  be 

interpreted  as  merely  transitional,  or  as 
distinct  phonetic  seaments.  Broad  [1] 

gives  an  example  where  the  schwa  in  the 
word  "away"  ir.  "we  were  away"  looks  just 
like  a  typical  formant  transition. 

d)  Unstressed  tense  vowels  often  tend  to  look 
like  their  stressed  but  lax  counterparts. 
Thus,  the  formants  of  the  [i]  in  "pretty 
good"  can  look  like  a  stressed  [I], 

Signal  ambiguities,  such  as  the  examples 
given  above,  can  lead  to  segmentation  and 
labeling  errors.  Such  errors  occur  also  as  a 
result  of  normal  but  unpredictable  local 
variations  in  the  signal,  which  frequently 
dearade  the  performance  of  recognition 
programs.  There  are,  of  course,  also  the 
usual  errors  due  to  insufficient  knowledge. 
All  these  errors  combine  to  make  recognition 
based  on  acoustics  alone  very  difficu’t. 

Segmentation  errors  appear  in  the  form 
of  missinq  or  extra  segments.  Labeling 
errors  cause  the  wrong  phoneme  to  be 
identified  with  a  particular  seoment.  Both 
types  of  errors  can  make  it  difficult  for  the 
correct  word  to  match  [8],  In  our  system,  a 
small  change  in  the  quality  of  the  APR  makes 
a  larne  change  in  the  performance  of  the 
entire  system.  If  an  APR  is  required  to  come 
to  a  single  decision  at  every  point  (i.e. 
produce  a  linear  string  of  sinqle  phoneme 
segments)  then  segmentatior  and  labeling 
errors  could  often  be  fatal.  Such  errors 
might  be  tolerated  by  the  rest  or  the  system 
if  there  is  a  small  vocabulary  and/or  a 
limited  syntax,  from  which  to  draw 
constraints.  But  if  these  constraints  are 
not  stringent  enough,  at  d  a  sinale 
segmentation  is  desired,  then  the  APR  must 
perform  extraordinarily  well  to  yield  good 
overall  recognition.  It  is  clear  that  in 
general  such  accuracy  in  acoustic  recognition 
is  unlikely.  One  must  be  able  to  generate 
alternate  choices  so  that  the  probability  of 
correct  recognition  is  increased.  This  is 
discussed  below. 
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Vagueness  in  Recognition 

The  solution  that  we  have  adopted  to 
deal  with  anMguities  in  the  siqnal  and  with 
segmentation  and  laheling  errors  is  to 
introduce  a  certain  amount  of  vagueness  into 
the  recognition  process. 

Vaqueness  in  label  ng  is  accomplished  by 
allowing  more  than  one  phoneme  to  represent  a 
segment.  This  increases  the  chances  of 
havinq  the  correct  phoneme  appear  in  a 
sequent  label.  However,  this  also  means  that 
the  number  of  possible  word  matches  [8)  in 
each  part  of  an  utterance  will  also  increase. 

Vagueness  in  segmentation  is  implemented 
by  allowinq  more  than  a  single  segmentation 
of  any  region  of  the  given  utterance. 

Instead  of  havinq  only  a  sequence  of  adjacent 
seqments,  we  now  have  the  possibility  of 
overlapping  segments.  The  resulting 

segmentation  forms  what  we  call  a  SEGMENT 
IATTICE  ( _o  be  described  under  Segmentation 
and  Label .ng;  see  also  [8)1.  Again,  this 
vagueness  in  segmentation  increases  the 

likelihood  of  finding  the  riqht  worts. 

However,  many  other  words  are  found  in 
addition. 

It  is  desirable  tc  have  the  correct 
words  which  are  provided  by  the  solutions 
described  above,  but  the  problems  of  dealinq 
with  a  larqe  number  of  extra  words  can  oe  a 
very  heavy  burden  on  the  system.  Hot  only 
will  there  be  an  increase  in  computation  but 
the  problem  of  evaluating  the  different 
combinations  of  words  can  become  very 
difficult.  Therefore,  one  must  be  able  to 
adjust  vagueness  thresholds  to  keep  a 
workable  balance  between  vaqueness  and 
correctness  of  segmentation  and  labeling. 

One  solution  is  to  include  with  eac 
seqment,  and  with  each  phoneme  in  a  seqmont 
label,  a  confidence  measure  of  that  being  the 
correct  path  (sequence  of  segments)  or 
phoneme.  Most  APR's  use  some  sort  of  scoring 
algorithm  to  choose  a  path  or  a  label.  If 
the  scores  correlate  well  enough  with  reality 
to  be  used  as  a  basis  for  a  decision,  they 
are  also  valuable  as  a  mechanism  for 
dynamically  varying  the  number  of  choices 
durinq  lexical  retrieval  (8).  In  other 
words,  by  setting  thresholds  to  be  used  with 
the  scores,  this  system  can  simulate 
vaqueness  in  a  variable  way.  The  question  of 
how  many  paths  through  an  utterance  to  allow 
is  an  efficiency  matter.  One  would  clearly 
not  want  to  keep  around  information  about  all 
the  possible  paths.  However,  as  long  as  the 
scores  assigned  to  ttie  paths  are  meaningful, 
keepinq  more  paths  around  does  not  increase 
vagueness.  It  merely  makes  the  system  more 
flexible . 

Acoustic  Phonetic  Recognition 

- irrSPEEcTTLTT- - 

The  APR  component  in  the  current  BHN 
Speech  Understanding  System  consists  of  two 
basic  sections:  parameter  extraction,  and 
segmentation  and  labeling.  The  parameter 
extraction  component  operates  on  the  speech 
signal  at  reqular  intervals  and  produces  a 
set  of  parameters.  These  parameters  are  then 


used  by  the  segmentation  and  labeling 
component  to  perform  the  actual  feature 
extraction  or  recognition.  The  segmenter 
locates  possible  phoneme  boundaries  and 
constructs  a  lattice  of  optional  segmentation 
paths.  Each  boundary  has  associated  with  it 
a  confidence  that  it  corresponds  to  an  actual 
boundary.  The  labeler  then  describes  each 
segment  in  the  lattice  in  tfcrms  of  acoustic 
features  or  phoneme  classes,  which  are 
reduced  to  a  small  set  of  possible  phonemes. 
Also  associated  with  each  Begr..-*'*'  is  a 
measure  of  confidence  that  the  correct 
description  was  found. 

Parameter  Extra  ction 

The  analog  speech  siqnal  is  sampled  at 
2D  kHz  into  12  bit  samples  and  then 
normalized  to  9  bits.  All  further  processing 
is  done  on  the  sampled  data.  Preemphasis  by 
simple  differencing  is  employed  only  to 
obtain  an  enerqy  measure  (P0D)  and  a 
derivative  of  the  preemphasized  spectrum 
(SUE) . 

Parameters  are  computed  at  the  rate  of 
10D  frames  per  second.  For  each  frame,  an 
FFT  is  computed  on  20  msec  of  the  siqnal 
(Hamming  windowed).  The  spectral  region  from 
5-10  kHz  is  used  only  once  to  obtain  a 
measure  of  ti  e  energy  in  that  region  (R0H) . 
All  other  parameters  are  obtained  by  applying 
a  14  pole  SELECTIVE  LINEAR  PREDICTION  [ 5)  to 
the  D-5  kHz  region  of  the  spectrum.  The 
following  table  describes  the  basic  set  of 
parameters  used.  (For  details  on  parameters 
related  to  linear  predictive  analysis,  see 
references  [5,6,7].) 

NAME  DEFINITION  OR  DESCRIPTION 

Rl  Enerqy  in  the  1-5  kHz  region 

R1  Normellzed  let  eutocorreletlon  coefficient. 

Alto  equel  to  the  average  of  e  cosine  weighted 
spectrum 

R#D  Energy  of  the  differenced  signel  ■  2*R((1-Rl) 

V  Normalized  LP  (linear  prediction)  error.  Alio 

equsl  to  the  ratio  of  the  geometric  mean  of  the 
LP  spectrum  to  its  arithmetic  meen 

VP  -11  log  V 

TPr  Frequency  of  the  complex  pole-pair,  using 

linear  prediction  with  2  Instead  of  14  poles  I  ) 

R#H  Energy  in  the  5-lf  kHz  region 

SD  Averege  ebsolute  value  of  the  change  in  the  LP 
spectrum  between  two  consecutive  frames  (in  dB) 

SDE  Averege  ebsolute  velue  of  the  chenge  in  the 
preempheslzed  LP  spectrum  (in  linear  units) 

Ft  Fundsmentel  frequency 

Figure  li  Bsslc  Parameter! 

There  is  a  set  of  corresponding 
parameters  which  reflect  the  change  in  the 
values  of  the  parameters  over  a  single  frame 
(10  msec).  These  parameters  have  the  same 
name  prefixed  by  a  *D".  Another  set  of 
parameters  reflect  the  change  in  the 
parameters  over  30-50  msec.  These  parameters 
have  the  suffix  "S"  (for  “slow").  For 

example,  along  with  the  parameter  R0  we  also 
have  the  "difference"  parameters  DR0  and 
DROS.  In  addition,  the  formants  are 

determined  from  the  poles  of  the  LP  model. 
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Segmentation  and  Labeling 

The  present  segmentation  and  labeling 
component  can  be  broken  into  several  major 
phases.  These  phases  are  logically  separate 
but  sequential  (ordered).  In  the  present 
implementation,  however,  they  are  executed  in 
parallel,  with  appropriate  lags  separating 
them  so  that  the  analysis  of  one  phase  car 
effectively  use  any  results  of  the  previous 
phases . 

Segmentation.  A  piecewise  linear 
approximation  to  the  formants  is  used  to 
indicate  possible  “formant  boundaries*.  In 
the  first  phase  of  segmentation,  'or  each 
frame  the  absolute  value  of  each  ('.inference 
parameter  is  compared  with  a  threshold 
related  to  the  specific  parameter  If  the 
threshold  is  exceeded,  a  score  corresponding 
to  this  parameter  is  added  to  a  total  score 
for  the  likelihood  that  there  is  a  boundary 
at  that  frame.  Parameters  considered  in  this 
phase  ares  DVP,  DK0 ,  SD,  WPS ,  DR0D,  SDE , 
FMBDR ,  DR0S ,  and  DR0DS,  in  decreasing  order 
of  importance. 

The  values  of  the  thresholds  are  such 
that  most  frames  will  end  up  with  a  score  of 
zero.  However,  when  there  is  a  boundary, 
there  is  usually  more  than  one  frame  with  a 
non-zero  score.  In  the  second  phase  of 
segmentation,  adjacent  non-zero  frames  within 
40  msec  are  “merged*  into  one  boundary,  if 
there  is  no  evidence  of  a  short  nasal  stop  at 
that  point. 

In  the  third  phase  of  segmentation,  a 
piecewise  linear  fit  to  the  parameter  R0D  is 
used  to  find  new  boundaries.  If  one  of  these 
new  boundaries  is  close  to  a  merged  boundary, 
then  the  time  of  the  boundary  is  changed  to 
that  of  the  new  one.  If  there  is  no  nearby 
boundary,  then  a  new  boundary  is  created. 

Since  the  above  procedures  tend  to  find 
many  extra  boundaries,  those  with  lower 
scores  are  considered  optional.  At  this 
point,  a  LATTICE  of  segments  is  formed  to 
express  the  optional ity. 

The  lattice  structure  makes  it  possible 
to  express  different  paths  (sequences  of 
segments)  describing  the  period  between  two 
points  in  the  utterance.  In  the  lattice 
structure  shown  below,  the  horizontal  axis 
represents  time,  and  the  vertical  lines 
represent  segment  boundaries.  The  numbers 
are  used  to  identify  unique  segments.  There 
are  3  ways  to  describe  the  period  from  A  to 
B*  (1-2;  3-4-2;  5-6-7),  two  ways  to  describe 
period  B  -  C;  (8;  10-11),  and  two  ways  to 
describe  period  C  -  Di  (9;  12-13-14).  In 
all,  there  are  3x2x2-12  ways  to  describe  the 
period  from  A  to  D. 


I--S--I - * . r 
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Figu-e  2i  Example  Bagman t  Lattice 


Labeling.  The  labeling  procedure  for 
each  segment  consists  r  f  comparing  average 
values  of  parameters  (over  the  central  half 
of  the  segment)  to  t.  ’•-sholds  for  several 
features  (see  table  below) .  The  averages  of 
adjacent  segments  and  the  change  in  each 
parameter  over  the  segment  are  also 
considered.  The  table  below  shows  how  a  high 
or  increasing  value  of  each  parameter 
correlates  with  the  different  features. 
Opposing  features  are  separated  by  slashes, 
so  that  the  presence  of  the  first  implies  the 
absence  of  the  second.  For  example,  a  high 
total  energy  (R0)  indicates  a  sonorant  and  a 
nonobstruent  at  the  same  time. 


PARA* 

DESCRIPTION 

FEATURES  AFFECTED 

HO 

Total  Energy 

So norent/Obstruent , 

Vowel/Nasal,  Volced/Unvolced, 
Fricative/ Plosive 

ROD 

Energy  of  Diffe*enced 
Signal 

(Same  kind  of  evidence  as  Rf) 

R0H 

Energy  Letw*w 

5-11  kHz 

Obstruent /Sonorant, 

Fricative/ Plosive, 
vowel  /Naeal 

VP 

Normalized  Error 

Sonorant,  Nasal,  Voiced 

TPF 

Frequency  of  2-pola 

LP  model 

Fricative,  Vowel/Nasal 

Peflecte  tongue  height  of 
vowels  between  2 11-811  Is. 

RI 

1st  Autocorrelation 
Coaf f iciant 

Indicates  lack  of  high  frequency 
energy,  not  e  Fricative 

ro 

Fundamental  Frequency 

Its  preeencs  Indicates  voicing 

ri 

First  Three 

Give  Information  about  the 

T2 

FJ 

Foments 

piece  of  articulation  of 
vowele  end  glidee. 

Figure  3ji^  Loballng_Paraw>atara^^^^^_^^^^^ 

Associated  with  each  segment  description 
is  a  segment  confidence,  which  is  a  score 
that  reflects  the  confidence  that  the  correct 
phoneme  is  included  in  the  label.  It  is 
related  to  the  scores  of  its  constituent 
features,  which  depend  on  the  deviation  of 
each  of  the  pieces  of  evidence  (mostly 
parameter  averages)  from  their  neutral 
points.  If  one  of  the  feature  decisions  is 
close  to  its  neutral  point,  no  decision  can 
be  made  reliably,  so  both  options  arc  kept. 

An  attempt  is  made  to  fit  cubic 
polynomials  to  the  formants  of  seoments  with 
high  energy.  Target  formants  determined  from 
these  cubics  are  compared  against  model 
tarqets  for  the  15  vowels  and  glides  in  our 
system.  Included  is  a  frequency 

normalization  based  on  the  fundamental 
frequency.  The  matching  procedure  takes  into 
account  the  individual  values  of  the  formants 
as  well  as  the  values  of  the  formants 
relative  to  each  other.  The  resulting  match 
scores  are  used  (along  with  duration  for 
glides  and  diphthongs)  to  select  up  to  four 
phonemes  for  the  segment  label. 

For  those  segments  labeled  as  strident 
fricatives,  the  place  of  articulation  is 
determined  by  a  threshold  on  the  two-pole 
frequency  (TPF)  computed  at  a  point  two 
thirds  of  the  way  into  the  segment. 

30D  Dip  Detector.  After  the  basic 
segmeiitTng  and  labelina  is  finished,  a  dip 
detector  is  applied  to  the  parameter  R0D  to 
find  additional  boundaries.  If  these 
boundaries  do  not  correspond  to  the  existing 
boundaries,  additional  (optional)  branches 
are  added  to  the  lattice,  and  the  new 


BEN  Report  No.  2856 


Bolt  Beranek  and  Newman  Inc 


soginents  are  labcleJ  in  the  normal  manner. 
The  times  of  these  new  boundaries  were  found 
to  correspond  very  well  with  the  hand  labeled 
boundaries.  Therefore,  these  new  boundaries 
wil?,  in  the  future,  be  used  to  adjust  the 
time  of  the  other  boundaries. 

Special  Cases.  There  are  some  checks 
made  which  take  into  account  certain 
phonological  phenomena.  Certain  segment 
boundaries  found  toward  the  end  of  the 
sentence  are  ignored  because  or  the  tendency 
to  stretch  out  the  end  of  a  sentence.  A  path 
in  the  lattice  described  as  unvoiced  plosive 
followed  by  unvoiced  weak  frication  is 
bridqed  by  an  optional  sinqle  seqment  labeled 
as  unvoiced  plosive.  Lonq  plosives  are 
optionally  split  into  two  plosives.  Two 
adjacent  seqments  with  identical  labels  are 
bridqed  with  one  seomunt.  These  and  other 
similar  rules  take  into  account  some  of  the 
inherent  ambiguity  in  the  acoustic  waveform. 

Future  System 

At  this  time  statistical  studies  of  the 
correlations  between  certain  parameters  and 
features  are  being  carried  out.  The  scores 
on  segment  boundaries  or  on  phonemes  within  a 
label  will  be  determined  by  probabilities 
based  on  these  studies.  In  keeping  with  the 
philosophy  held  here,  each  segment  label  will 
consist  of  a  score  for  each  phoneme  (36  in 
our  present  systen) .  Then,  depending  on  the 
application,  the  lexical  retriever  would  use 
all  phonemes  with  a  score  above  a  certain 
threshold  to  achieve  the  desired  vagueness. 
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Abstract 


Automatic  speech  understanding  requires 
the  development  of  programs  which  can 
formulate  hypotheses  about  the  content  of  an 
utterance  and  attenpt  to  verify  them.  One 
example  of  such  activity  in  the  DUII  Speech 
Understanding  System  (SPEECHLIS)  is  the  use 
of  information  from  a  feature  analysis  of  the 
sampled  speech  siqnal  to  propose  and  evaluate 
word  matches  which  cover  portions  of  the 
input  utterance.  Words  proposed  by  higher 
level  components  are  also  verified  against 
the  feature  analysis.  It  is  at  this 
interface  between  acoustic  transcription  and 
word  matches  that  knowledge  ah  out  the 
vocabulary,  phonemic  spellings,  phoneme 
similarity,  and  phonological  rules  is 
represented  and  applied.  The  representation 
and  use  of  such  knowledge  in  the  SPEECHLIS 
system  is  described. 

Introduction 

A  central  problem  in  automatic  speech 
understanding  is  how  to  select  words  for 
consideration  as  possible  components  of  an 
utterance.  If  there  are  too  many  words  to 
consider  in  each  region  of  the  utterance, 
then  not  only  will  the  processing 
requirements  tend  to  explode,  but  also  the 
evaluation  procedures  can  become  untractable. 
Therefore,  in  order  to  treat  the  problem  of 
speech  understanding  effectively,  one  must 
develop  experience  and  insight  into  how  to 
perform  word  selection  while  restricting 
possible  conbinatoric  explosions. 

The  information  available  for  word 
selection  includes  the  vocahulary  and  how  its 
words  are  pronounced,  the  syntactic, 
semantic,  and  praomatic  constraints  of  the 
task  domain,  the  acoustic  transcription 
(which  includes  segmentation  and  labeling), 
and  knowledge  about  the  ways  in  which  the 
pronunciation  of  words  can  vary  (phonological 
rules) .  For  task  domains  which  deal  with  a 
small  vocabulary  and/or  have  strong  syntactic 
and  semantic  constraints,  the  number  of  words 
which  could  appear  in  a  given  region  of  the 
utterance  can  be  limited  substantially.  For 
certain  such  systems,  possible  words  and 
partial  word  sequences  can  be  enumerated  (in 
a  "top-down"  manner)  before  considering  the 
acoustic  transcription,  and  then  ordered  on 
the  basis  of  how  well  they  match  the  acoustic 
transcription.  The  BDIJ  speech  understanding 
project (5J  has  chosen  to  develop  a  system  for 
tasks  in  which  such  constraints  are  not 
strong  enough  to  so  limit  the  sets  of 
possible  words  in  the  early  stages  of  the 
understanding  process.  Instead,  in  a 
"bottom-up"  manner,  information  from  the 
acoustic  transcription  is  used  in  an  initial 


phase  of  hypothesis  formation  to  suggest 
words  which  match  well.  These  words  are  then 
sent  to  syntax  and  semantics  for 
consideration. 

Word  selection  occurs  in  SPEECHLIS  at 
the  interface  hetweon  acoustic-phonetic 
programs  which  construct  the  acoustic 
transcription  1 4 )  and  syntactic,  semantic, 
pragratic  and  control  proqrams  which  combine 
word  matches  into  tentative  hypotheses  about 
the  moaning  of  the  utterance (1 , 2 , 3) .  The 
programs  that  perform  word  selection  have  two 
tasks:  to  use  the  acoustic  transcription  to 
propose  words  which  could  have  been  spoken 
(Lexical  Proposal),  and  to  evaluate  how  well 
a  proposed  word  matches  the  acoustic 
information  (  .exical  Matchino)  .  The  term 
"lexical  retrieval"  will  be  used  to  represent 
these  two  tasks.  This  paper  describes  the 
way  in  which  lexical  retrieval  fits  into  the 
SPEECHLIS  system,  the  strategies  for  Lexical 
Proposal  and  Lexical  Matching,  and  the 
representation  and  use  of  phonological  rules. 

Lexical  Retrieval  in  SPEECHLIS 
Data  Structures 

The  lexical  retrieval  proorams  have 
access  to  data  structures  which  represent  the 
acoustic  transcription  of  the  utterance,  the 
vocabulary,  a  corpus  of  phonological  rules, 
and  a  "phoneme  similarity  matrix". 

The  Acoustic  Transcription.  The 
acoustic  transcription  Is  in  the  form  of  a 
structured  collection  of  SEGMENT  descriptors. 
Uy  a  segment  we  mean  a  portion  of  the 
utterance  which  is  hynothesized  to  be  a 
single  ptioneme.  Each  segment  has  a 
description  which  could  in  principle  specify 
the  phonemic  identity  of  the  seoment,  but  in 
general  merely  constrains  this  identity  to 
one  of  several  phonemes.  This  set  of 
phonemes  represents  the  acoustic  features 
that  were  detected  in  a  feature  analysis  of 
the  segment.  The  number  of  phonemes  in  the 
set  reflects  the  level  of  detail  in  the 
result  of  the  feature  analysis.  This  level 
of  detail  is  adjusted  for  each  seqment  to 
maintain  a  reasonable  balance  between 
vagueness  of  feature  description  and 
confidence  that  the  feature  description  is 
correct.  For  each  segment  and  each  boundary 
between  segments  in  the  segment  lattice,  a 
crude  measure  of  this  confidence  is 
represented.  Alternative  hypothesized 
segments  may  overlap  in  the  utterance, 
resulting  in  a  lattice  of  segment  descriptors 
rather  than  a  single  string.  Figure  1  gives 
an  example  of  such  a  SEGMENT  LATTICE.  Tne 
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numbers  along  the  top  are  used  to  identify 
the  boundaries  between  segments.  Each 
segment  is  labelled  with  its  set  of 
alternative  phonemes.  This  structure  allows 
for  the  representation  of  uncertainty  or 
ambiguity  both  in  the  determination  of  the 
segment  boundaries  and  in  the  identity  of  a 
segment.  1 

*^le  Vocab  ulary .  The  vocabulary  is 
represented ai  a  sot  of  words  (currently 
250),  each  having  a  set  of  its  most  likely 
pronunciations  as  lists  of  phonemes  and 
syllable  boundary  markers.  On  the  average, 
there  are  about  two  pronunciations 

represented  for  each  word  in  the  vocabulary. 
Associated  with  each  phoneme  is  an  estimate 
of  the  probability  that  it  will  be  absent  in 
an  actual  pronunciation  of  the  word.  Each 
vowel  iias  an  expected  stress  value  (either 
"primary  stress",  "secondary  stress",  or 
unstressed").  There  also  exists  a 
cross-referenced  data  structure  for  the 
vocabulary  which  has  for  each  phoneme  a  list 
of  words  which  either  start  or  end  with  that 
phoneme,  and  for  each  ordered  pair  of 
phonemes  a  list  of  words  in  which  that 
phoneme  pair  occurs,  with  the  associated 
indices  into  the  phonemic  spellings. 

The  Similarity  Matrix ■  Information  about 
the  similarity  of phonemes  is  represented  in  * 
SIMILARITY  matrix/  Each  entry  in  thts  matrix 
is  an  estimate  of  the  likelihood  for  a  pair 
of  phonemes  (PI  P2)  that  a  segment  labelled  P2 
is  really  PI,  i.e.  how  "similar"  is  P2  to  PI 
The  similarity  matrix  has  two  uses:  to 

adjust  for  the  known  performance  of  .h. 

S  v?hrtiC  ,  pro,rams.  and\o  Account 
ation  °*  variatlons  in  phoneme  pronunci- 

*5*°"  not  yet  implemented  as 

these^es timate</°S '  thc  P««ent  system, 
“  are  derived  from  o' 

instincesSofaShWC  gather  statistics  from  real 
f  pho?eme  confusion,  we  win 
adjust  these  estimates. 

Phonological  Knowledge.  Phonological 
knowledge  tells  us about  the  ways  in  which 
the  pronunciation  of  words  can  vary.  One  of 
the  tasks  of  the  lexical  retrieval  programs 
is  to  take  account  of  such  knowledge  as  these 
programs  look  for  word  matches  in  the  segment 
lattice,  in  addition  to  the  phonological 
information  in  the  phonemic  dictionary  and  in 
the  similarity  matrix,  SrEECIILIS  has  a  corpus 
of  context-dependent  analytic  phonological 
rules.  These  are  represented  in  a  collection 
of  data  structures  which  specify  contexts  in 
the  segment  lattice  in  which  phone  nes  can  be 
changed,  inserted,  or  deleted.  Because  they 
represent  transformations  from  observed 
phonetic  sequences  to  sequences  which  conform 
to  the  phonemic  spellings  in  the  dictionary, 
these  are  termed  analytic  (as  opposed  to 
generative)  phonological  rules.  Each  rule 
has  three  components: 

a)  a  template  describing  the  necessary 
context  to  be  searched  for  in  the  segment 
lattice • 

b)  a  description  of  a  new  branch  to  be  added 
to  the  lattice,  given  the  presence  of  the 
necessary  context.  The  attributes  of  this 
new  branch  can  depend  on  the  attributes  of 
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the  context  found  in  the  lattice. 

c)  a  predicate  (see  below). 

The  segment  lattice  as  constructed  by  the 
acoustic-phonetic  programs  represents  initial 
(and  currently,  largely  contaxt-free) 
hypotheses  as  to  the  existence  of  boundaries 
and  acoustic  features  of  segments  in  the 
utterance.  After  this  segment  lattice  is 
constructed,  a  rule-interpretation  program 
applies  the  set  of  rules  to  the  lattice.  The 
action  of  these  rules  is  never  to  change  the 
existing  lattice  structure,  but  rather  to  add 
new  branches  which  specify  optional  paths 
through  the  lattice.  In  general,  the 
admissibility  of  the  new  branch  cannot  be 
entirely  determined  from  the  information  in 
the  lattice  alone.  It  is  the  job  of  the 
predicate  to  complete  the  task  of  determining 
the  applicability  of  the  rule  when  a  portion 
of  a  particular  phonemic  spelling  is  being 
considered  by  the  lexical  matcher.  A 
predicate  is  an  arbitrary  Boolean  function  of 
three  arguments:  a  phonemic  spelling,  the 
phoneme  position  within  the  spelling  at  which 
the  new  branch  is  being  considered,  and  a 
pointer  to  the  new  branch  in  the  lattice.  A 
pointer  to  the  rule's  predicate  is  attached 
to  each  new  branch  when  the  branch  is  added 
to  the  lattice.  This  pointer  is  used  by  the 
lexical  matcher  to  access  and  apply  the 
predicate.  The  predicate  returns  true  if  it 
accepts  the  use  of  the  branch  IrTBie  word 
match  or  false  if  it  rejects  it. 

Additional  branches  inserted  by  the 
rul  *s  ensure  that  the  lexical  retrieval 
programs  will  consider  those  standard  word 
spellings  which  could  have  the  indicated 
phonological  variation.  Such  a  scheme  serves 
to  (implicitly)  select  for  consideration 
ations  on  the  standard  phonemic  spelling 
O.JLY  WHEN  the  standard  spelling  is  not 
represented  in  the  segment  lattice  AND  a 
variation  of  it  is  possible  on  the  basis  of 
the  detection  of  an  appropriate  context  (in 
the  segment  lattice)  for  the  application  of 
the  phonological  rule.  Furthermore,  the 
pattern  match  processing  necessary  to  detect 
8ucb  u???texts  for  determining  the 
applicability  of  each  phonological  rule  is 
done  only  once  in  a  special  scan  over  the 
segment  lattice:  it  is  not  necessary  to 
analyze  the  segment  lattice  anew  for 
applicable  phonological  patterns  whenever  a 
standard  phonemic  spelling  is  being 
considered  by  the  lexical  matcher. 

An  example  of  a  phonological  rule  is  the 
Nasal  Deletion  Rule.  In  its  generative  form, 
it  is:  A  nasal  consonant  can  be  deleted  if 
it  occurs  immediately  after  a  vowel  and 
immediately  before  a  nonnasal  consonant  with 
the  same  place  of  articulation."  This  rule 
says,  for  example,  that  in  the  word  "sample" 
the  ,  ,,  may  be  deleted  (and  the  preceding 

^aWM  nasalized>’  U  is  implemented 

analytically  as:  "If  there  exists  a  path 

through  the  lattice  such  that  a  vowel  segment 
is  followed  by  a  nonnasal  consonant  (not  Ih] 
or  IrJ),  then  bridge  the  existing  vowel 
segment  by  a  two-segment  branch  consisting  of 
the  vowel  followed  by  a  nasal.  Attach  a 
predicate  (described  below)  to  the  nasal 
segment.  (If  such  a  branch  already  exists, 
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then  no  new  branch  need  be  added.)  The 
predicate  requires  that  the  nasal  not  be  wool 
initial,  and  it  checks  that  the  preceding 
phoneme  (of  the  phonemic  spelling)  is  indr  id 
a  vowel  (and  not  a  non-vowel  which  mate  ed 
via  a  similarity),  and  that  the  nasal  and  the 
following  consonant  match  in  place  of 
articulation. 

Output.  The  output  of  the  lexical 
retrieval programs  is  a  set  of  WORD  MATCHES. 
Each  word  match  is  a  correspondence  between 
one  phonemic  spelling  of  a  word  and  a  path 
through  the  segment  lattice.  A  score  is 

associated  with  each  word  match  to  indicate 
how  well  the  phonemic  spelling  matches  the 
sequence  of  segment  descriptors.  Word 
matches  to  be  examined  by  syntax,  semantics, 
and  pragmatics  are  entered  into  a  WORD 
LATTICE  (such  a  lattice  is  illustrated  in 
Figure  2) .  In  this  figure,  for  example,  the 
word  "mean*,  spelled  (M  IY  N),  matches  from  2 
to  5  in  the  lattice,  while  the  word  "print  , 
spelled  (P  R  It)  N  Tl ,  matches  from  0  to  5. 
The  first  of  the  two  numbers  in  parentheses 
for  each  word  represents  the  score  of  the 
word  match.  The  second  number  represents  the 
maximum  possible  score  for  that  word  on  the 
basis  of  the  length  (number  of  phonemes)  of 
its  phonemic  spellina. 


The  overall  control  strategy  for 

SPEECHLIS,  starting  from  an  acoustic 
transcription  which  has  been  expanded  by  the 
analytic  phonological  rules,  is  first  to 
perform  a  scan  over  the  entire  segment 
lattice  to  find  word  matches  anywhere  in  the 
utterance  which  are  longer  than  two  phonemes 
and  which  match  well.  These  ere  used  to 

construct  an  initial  word  lattice.  An 

attempt  to  find  acceptable  word  matches  at 
the  beginning  of  the  utterance  from  a  set  of 
likely  sentence-initial  words  then  occurs. 
Any  such  word  matches  are  added  to  the  word 
lattice.  The  system  then  enters  a  phase  of 
tentative  hypothesis  formation#  in  which  word 
matches  from  the  word  lattice  are  combined 
into  word  match  aggregates  (called  THEORIES) 
on  the  basis  of  semantic,  syntactic,  or 
pragmatic  support.  As  the  system  *^en 

attempts  to  verify,  enlarge,  and  combine 
these  theories,  the  lexical  retrieval 

programs  are  called  upon  to  evaluate  the 
matches  of  words  which  are  proposed  by 

syntax,  semantics,  and  pragm'tics.  Examples 
of  sue*,  proposals  arei  content  words  which 
are  likely  to  be  adjacent  to  a  word  being 
considered,  function  words  which  are  likely 
to  follow  a  sequence  of  words,  and  possible 
inflectional  endings  and  auxiliary  verbs  for 
a  given  word. 


An  extensive  set  of  parameters  are 
available  for  controlling  th'  activity  of  the 
lexical  retrieval  programs.  These  parameters 
allow  the  specification  of  constraints  on  the 
length  of  acceptable  words,  word  match 
quality  acceptance  thresholds,  and 
requirements  that  word  matches  begin  or  end 
at  a  specified  boundary  or  in  a  specified 
region  of  the  segment  lattice.  In  addition, 
there  are  parameters  for  selecting  among 
several  strategies  for  searching  and 
matching,  including  the  consideration  of  word 


matches  with  missing  or  extra  segments. 
These  strattgies  are  described  below. 


Strategies 


Lexical  Proposal 

There  are  two  ways  in  which  words  can  be 
selected  for  consideration  from  the 
information  in  a  specified  reqion  of  the 
segment  lattice.  One  way  is  to  consider,  for 
each  phoneme  of  each  segment  in  the  reqion, 
the  set  of  word  spellinas  which  beoin  or  end 
with  that  phoneme.  This  is  called  an 
"anchored"  scan.  The  other  method  is  tdie 
"unanchored"  scan,  in  which  a  word  spelling 
is  proposed  if  it  has  a  specified  pair  of 
adjacent  phonemes  anywhere  in  its  spelling. 

A  set  of  such  phoneme  pairs  is  computed  for 
each  pair  of  adjacent  segments  in  the  given 
region  of  the  segment  lattice.  This  set  is 
the  cross  product  of  the  phoneme  sets 
representing  the  two  segments.  The 
unanchored  method  is  currently  being  used  in 
SPEECHLIS  for  the  complete  initial  scan. 

Lexical  Matching 

The  lexical  matching  algorithm  is  a 
"recursive  tree  walk".  For  a  given  boundary 
in  the  segment  lattice,  a  given  phonemic 
spelling,  and  a  given  index  to  one  of  the 
phonemes  in  the  phonemic  spelling,  this 
algorithm  walks  the  segment  latti5e 
postulating  phoneme-segment  matches.  The 
index  into  the  phonemic  spelling  is  "aligned 
with  the  given  boundary  in  the  lattice.  If 
the  given  index  divides  the  phonemic  spelling 
into  two  parts,  as  is  usually  the  case  during 
an  unanchored  scan,  then  a  "middle-out"  walk 
is  performed.  Otherwise,  either  a 

"left-to-right"  or  a  "right-to-left  walk  is 
done,  depending  on  whether  the  index  points 
to  the  first  phoneme  (left  end)  of  the 
phonemic  spelling  or  to  the  last  phoneme 
(right  end) .  For  possible  missing  or  extra 
segments  and  branch  points  in  the  segment 
lattice,  the  matcher  is  called  recursively  to 
consider  the  alternate  paths  through  the 
segment  lattice.  Each  postulated 

phoneme-segment  match  is  evaluated  on  the 
basis  of  the  similarity  between  the  given 
phoneme  and  the  most  similar  phoneme  in  the 
segment  label.  The  phoneme- segment  match 
score  is  quantized  as  a  number  between  zero 
and  5;  a  higher  score  represents  a  better 
match.  Each  phoneme-segment  evaluation  is 
used  to  adjust  a  cumulative  overall  word 
match  score.  This  score  is  initialized  to 
the  maximum  possible  score  for  the  word,  and 
is  incrementally  adjusted  as  phoneme-segment 
match  scores  are  considered.  This  maximum 
score  depends  on  the  lenqth  of  the  phonemic 
spelling.  For  each  vowel  in  the  phonemic 
spelling,  a  simple  analysis  of  the  segment 
duration  is  used  to  adjust  this  word  match 
score.  This  is  done  on  the  basis  of  whether 
the  vowel  is  tense  or  lax,  and  whether  it  is 
stressed  or  unstressed  in  the  word  spelling. 
For  examp) f^e  appearance  of  an  unstressed, 
lax  vowel  in  a  segment  having  a  duration 
greater  than  100  milliseconds  is  assumed  very 
unlikely.  Any  word  match  in  which  such  a 
phoneme-segment  match  is  a  component  will 
have  its  score  decreased  substantially.  If  a 
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missing  or  extra  segment  is  postulated,  its 
score  is  computed  from  a  priori  information 
(in  the  dictionary)  about  the  likelihood  of 
such  a  phenomenon  for  the  indicat  .d  portion 
of  the  phonemic  spelling.  If  the  word  match 
score  falls  below  a  specified  word  match 
score  acceptance  threshold,  consideration  of 
this  path  through  the  segment  lattice  is 
terminated.  Note  that,  because  of  branching 
in  the  segment  lattice,  it  is  possible  for  a 
phonemic  spelling  to  match  alonq  more  than 
one  path  through  the  same  reqion  of  the 
segment  lattice.  Of  these  matches  only  the 
ones  with  the  best  scores  are  entered  into 
the  word  lattice. 

Performance  and  Future  Work 

Since  the  first  version  of  SPF.ECHLIS  has 
only  recently  been  assembled,  we  are  not  yet 
able  to  present  a  thorouqh  analysis  of  the 
lexical  retrieval  performance  requirements 
for  acceptable  overall  system  performance. 
From  the  few  utterances  that  we  have  tried 
using  this  system,  however,  we  have  formed 
some  tentative  impressions! 

1.  For  a  normal-sized  utterance  (e.g.  9 

words;  5  content  words),  the  system  will 
probably  perform  well  with  an  initial  word 
lattice  havinq  rouohly  10(1  word  matches,  if 
all  or  all  but  one  of  the  content  words  are 
present  with  good  scores. 

2.  The  quality  of  overall  system  performance 

depends  qreatly  on  the  quality  of  lexical 
retrieval  performance.  The  payoff  of 

improvements  in  lexical  retrieval  performance 
will  be  hiqh . 

Work  unde-way  to  improve  lexical 

retrieval  performance  is  directed  toward 

increasing  the  number  and  quality  of  correct 
word  matches  found,  especially  from  the 
initial  scan,  while  keeping  both  the  number 
of  incorrect  word  matches  and  the  processing 
requirements  within  manageable  limits.  In 
addition  to  a  continuing  effort  to  improve 
the  programs  that  perform  segmentation  and 
labeling,  a  proqram  is  being  developed  in 
which  speech  synthesis  techniques  will  ba 
used  to  construct  a  general  representation  of 
the  expected  acoustic  parameters  for  a  given 
phonemic  spelling.  These  will  then  be 
matched  against  the  parameters  which  were 
extracted  from  the  real  speech  signal,  and  a 
score  which  represents  the  quality  of  the 
match  will  be  computed.  Depending  on  how 
well  this  "word  verification"  program 
performs,  it  will  be  used  either  to  augment 
or  replace  the  current  lexical  matching  pro¬ 
grams  . 


To  further  develop  our  experience  and 
insight  into  how  to  perform  lexical 
retrieval,  statistics  gathering  experiments 
are  being  desiqned  to  evaluate  the  relative 
reliability  of  different  kinds  of  segments 
and  boundarie  in  the  acoustic  transcription 
and,  for  each  word  in  the  vocabulary,  the 
relative  reliability  of  detection  of  those 
phonemes  which  one  would  expect  to  be 
"robust"  (e.q.  stressed  vowels  and  strident 
fricatives) . 


One  pressinq  problem  is  the  need  for  a 
more  riqorous  foundation  for  computing  word 
mate)!  scores.  As  we  learn  more  about  the 
relative  reliability  of  parts  of  the  acoustic 
transcription  and  about  ways  in  which  new 
correlations  between  phonemic  spellings  And 
acoustic  features  should  be  used  to  influence 
word  match  scoring,  we  will  be  able  to 
improve  our  present  (largely  intuitive) 
techniques. 

The  problems  of  dealing  with  larger 
vocabularies  (over  10  0  C  w'.rds)  and  more 
elaborate  phonological  knowledge  are 
imminent.  One  of  our  goals  is  to  develop  an 
understanding  of  how  our  lexical  retrieval 
techniques  should  change  as  the  system  grows. 
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Abstract 

Automatic  speech  understanding  must 
accomodate  the  fact  that  an  entirely  accurate 
and  precise  acoustic  transcription  of  speech 
is  unattainable.  By  applying  knowledge  about 
the  phonology,  syntax,  and  semantics  of  a 
language  and  the  constraints  imposed  by  u 
task  domain,  much  of  the  ambiguity  in  an 
attainable  transcription  ca  ho  resolved. 
This  paper  deals  with  how  v.,..  control  the 
application  of  such  knowledge.  A  control 
framework  is  presented  in  which  hypotheses 
about  the  meaning  of  an  utterance  are 
automatically  formed  and  evaluated  to  arrive 
at  an  acceptable  interpretation  of  the 
utterance.  This  design  is  currently 
undergoing  computer  implementation  as  a  part 
of  the  BBN  Speech  Understanding  System 
(SPEEC1ILIS)  . 

Introduction 

Speech  understanding,  whether  done  by 
people  or  by  computers,  demands  great  funds 
of  knowledge.  This  is  due  to  the  inherent 
imprecision,  variability  and  complexity  of 
the  acoustic  signal  into  which  all  speech  is 
encoded  [1]  .  Tor  example,  the  encoding  of  a 
word,  spoken  in  running  conversation,  will  he 
affected  by  its  environment  (the  words 
surrounding  it),  its  importance  to  the 
message  (stress  ant)  intonation),  its  speaking 
rate,  and  its  speaker.  It  is  an  apparent 
circular  dilerma  of  acoustic-phonetius  that 
one  cannot  precisely  identify  contextual 
effects  without  first  identifying  the 
context.  However,  we  believe  that  by 
applying  other  sources  of  knowledge  to  an 
initially  uncertain  and  imprecise  acoustic 
transcription,  in  order  to  hypothesize 
possible  higher  contexts,  this  circularity 
can  be  broken. 

In  this  paper,  we  are  concerned  with  the 
problem  of  how  to  control  the  application  of 
various  sources  of  knowledge  to  this  problem. 
A  framework  of  concepts,  data  objects, 
queues,  and  proqrams  is  presented  in  which 
strategies  for  forming  and  evaluating 
hypotheses  about  the  meaning  of  an  utterance 
may  be  implemented  and  studied.  One  such 
strategy  is  described,  and  an  example  of  its 
performance  is  given.  A  Speech  Understanding 
System  (SPEECHLIS)  being  developed  at  Bolt 
Beranek  and  Newman  provides  the  environment 
for  this  work,  and  derives  much  of  its 
structure  from  this  control  framework. 
Though  it  is  not  our  purpose  here  to  discuss 
in  detail  the  design  of  SPEECHLIS ,  it  will  bo 
useful  to  the  reader  to  know  that  it  contains 
several  knowledge  sources  as  components  -- 
acoustic-phonetic,  phonological,  lexical, 
semantic,  syntactic,  and  praqmatic. 

A  listener  does  not  just  passively 
accept  speech:  he  actively  uses  all  his 
knowledge  to  structure  uncertain  and 
incomplete  curs  from  the  acoustic  signal  into 


a  grammatical,  meaningful  and  appropriate 
utterance.  Sources  of  knowledge  which  are 
available  to  a  listener  include: 

1)  The  acoustic-phonetic  properties  of  the 
language  --  Knowledge  of  the 
correspondence  between  physically  varying 
parameters  of  the  speech  siqnal  and  the 
basic  phonetic  elements  of  the  language 
(phonemes)  . 

2)  The  'Vocabulary  —  One  presumes  that  an 
English  utterance  will  consist-  of  a 
sequence  of  English  words,  interspersed, 
perhaps,  with  pauses  and  non-speech 
sounds.  A  vocabulary  constrains  the 
possible  sequences  of  sneech  sounds  and 
the  set  of  words  which  might  fit  a 
particular  sequence. 

3)  Phonological  rules  --  These  rules  specify 
allowable  or  characteristic  variations  in 
the  pronunciation  of  words  or  phonemes  in 
particular  environments. 

4)  The  syntactic  structure  of  the  language  — 
A  sound  sequence  which  is  heard  as  the 
word  sequence  "in  other  samples"  will  not 
be  heard  as  "in  of  a  samples",  since  the 
latter  is  ungrammatical. 

5)  The  set  of  concepts  and  relations  that  are 
meaningful  to  the  listener  —  A  sound 
sequence  which  is  understood  as  "close  the 
doors"  will  not  be  understood  as  "close 
the  daws",  since  birds  don't  close. 

6)  Pragmatic  considerations  (knowledge  of  the 
current  context  or  situation)  --  A  similar 
sound  sequence  may  be  heard  as  "close  the 
Dewers"  in  a  room  in  which  the  only  thing 
open  is  a  bottle  of  scotch. 

Much  of  the  above  knowledge  is  specific 
to  a  problem  domain.  For  our  automatic 
speech  understanding  effort  at  Bnv,  we  have 
chosen  the  task  area  of  an  existing  natural 
languaqe  question-answering  system  (LUNAR) 
for  the  Apollo  11  moon  rocks  (4),  which 
answers  questions  such  as: 

How  many  breccias  contain  more  than  10» 
anorthosite? 

In  which  samples  was  titanium  found? 

Give  me  all  references  to  olivine 
twinning. 

In  doing  so,  we  have  bran  able  to  draw 
upon  our  knowledge  of  .  mar  geology  and 
question-answering  system  characteristics 
developed  during  work  on  that  system.  The 
LUNAR  system  operates  with  a  lexicon  of 
around  3500  words.  As  of  this  writing, 
SPEECHLIS  is  operating  with  a  250  word 
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lexicon,  with  a  larqer  one  of  about  1500 
words  in  preparation. 

Overview  of  the  Control  Framework 
Da ta  Objects 

The  control  framework  that  we  will 
discuss  'assumes  the  existence  of  proqrams 
which  have  access  to  various  sources  of 
knowledqe  For  example,  acoustic-phonetic 
and  ph  nv loqical  proqrams  operate  on  a 
diqitired  wave  form  to  produce  an  acoustic 
transcription  of  the  utterance  in  the  form  of 
a  collection  of  SEGMENT  descriptors.  By  a 
segment  we  mean  a  portion  of  the  utterance 
which  is  hypothesised  to  be  a  single  phoneme. 
Each  segment  has  a  description  which  could  in 
principle  specify  the  phonetic  identity  of 
the  segment,  but  in  general  merely  constrains 
this  identity  to  one  of  several  phonemes. 
Alternative  hypothesized  segments  may  overlap 
in  the  utterance,  resulting  in  a  lattice  of 
segment  descriptors  rather  than  a  single 
strinq.  Figure  1  gives  an  example  of  such  a 
SEGMENT  LATTICE.  This  structure  allows  for 
the  representation  of  uncertainty  or 
ambiguity  both  in  the  identity  of  a  segment 
and  in  the  determination  of  the  seqment 
boundaries . 

Lexical  retrieval  and  word  matching 
proqrams  are  available  to  map  sequencer  of 
segment  descriptions  into  words.  They  do 
this  by  matching  PHONETIC  SPELLINGS  of  the 
words  in  the  vocaiulary  anainst  sequences  of 
adjacent  segments.  The  correspondence 
between  a  single  phonetic  spelling  of  a  word 
and  a  seqment  sequence  is  called  a  WORD 
MATCH.  Since  the  acoustic  transcription  may 
make  errors  in  the  detection  of  seqmcnts, 
word  matches  involvinq  missing  or  extra 
seqments  may  also  be  made.  The  quality  of 
the  match  is  one  indication  of  the  likelihood 
that  the  word  actually  appears  at  that  place 
in  the  utterance.  Word  matches  to  be 
examined  by  syntax,  semantics  and  pragmatics 
programs  are  entered  into  a  WORD  LATTICE. 
(Such  a  lattice  is  illustrated  in  Figure  2.) 
In  this  figure,  for  example,  the  word  ’mean", 
spelled  phonetically  [mini,  or  to  use  our 
computer  representation  [M  JY  ij,  matches 
from  2  to  5  in  the  lattice,  while  the  word 
"any",  spelled  [  nil  or  (Ell  N  IY),  matches 
from  3  to  6. 

Each  phoneme  in  the  above  two  spellings 
satisfies  exa  tly  the  phoneme  description  of 
its  corresponding  seqment.  We  do  not  assume 
however  that  the  correct  phonemic  identity  of 
a  seqment  will  always  be  among  the  set  of 
phonemes  postulated  bv  the  acoustic-phonetic 
and  phonoloqical  proqrams.  Rather  we  assume 
that  if  they  err,  the  correct  phoneme  will  be 
similar  in  acoustic  characteristics  to  those 
qiven.  For  example,  at  the  beqinning  of  the 
seqment  lattice,  the  first  two  phonemes  of 
the  word  "give",  spelled  (glvl  or  [G  III  VJ , 
match  the  seqment  descriptors  perfectly.  The 
third,  (v),  is  sufficiently  close  to  [bj 
acoustically,  that  a  word  match  is  made  for 
"give*  and  entered  into  the  word  lattice. 
However,  since  the  acoustic  transcription  is 
the  best  evidence  we  have  of  what  the 
utterance  was  ,  our  confidence  in  "give" 
actually  beginning  the  utterance  is  less  than 


if  each  of  its  phonemes  had  matched 
perfectly. 

Interacting  with  the  word  lattice,  the 
higher  level  components  of  the  system 
(syntax,  semantics  and  pragmatics)  form 
internal  data  objects  called  THEORIES 
representing  hypotheses  about  the  original 
utterance.  A  theory  contains  a 
non-overlapping  collection  of  word  matches 
which  are  postulated  to  be  in  the  utterance, 
together  with  syntactic,  semantic  and 
pragmatic  information  about  this  collection 
and  scores  representing  the  evaluations  of 
that  theory  by  various  knowledge  sources. 

Theories  grow  and  change  as  additional 
bits  of  evidence  for  or  against  them  are 
found.  A  principal  mechanism  for 
accomplishing  this  is  the  creation  of 
MONITORS.  A  monitor  is  a  trap  set  by  a 
hypothesis  on  new  information  which,  if 
found,  would  result  in  a  chanae  or  extension 
of  the  monitoring  hypothesis.  However,  the 
reprocessing  that  is  called  for  when  a 
monitor  is  noticed  is  not  done  immediately. 
Rather  an  EVENT  is  created,  pointing  to  the 
monitor  and  the  new  evidence.  This  event  is 
evaluated  to  decide  if  and  when  to  do  it. 

In  addition  to  waiting  for  new 
information  (by  setting  monitors),  the  higher 
level  components  can  actively  seek  it  out. 
One  way  this  is  done  is  by  PROPOSALS.  A 
proposal  is  a  request  to  match  a  particular 
word  or  set  of  words  at  some  point  in  the 
utterances  any  of  the  hiqher  level  components 
can  make  proposals. 

A  short  example  should  illustrate  the 
above  concepts  more  clearly.  Notice  the 
robust  word  match  for  "chemical"  in  the  word 
lattice  shown  in  Figure  2.  The  semantics 
component  knows  about  CHEMICAL  ANALYSES  and 
CHEMICAL  ELEMENTS,  but  not  about  CHEMICAL  as 
an  independent  concept.  '  Since  "chemical" 
matches  veil,  semantics  might  postulate  that 
one  of  these  conceptE  is  being  designated. 
It  could  propose  "analysis",  "analyses", 
"determination" (all  naminq  the  first  concept) 
and  "element",  requesting  them  to  be  cor.psrcd 
aqainst  the  segment  lattice,  right  adjacent 
to  "chemical".  Since  "analyses"  and 
"analysis"  match  wel' ,  events  would  be 
created,  linking  the  hypothesis  for 
"chemical"  with  those  for  "analysis"  and 
"analyses".  Given  that  CHEMICAL  ANALYSIS 
refers  to  the  amount  of  each  major  element  in 
some  rock,  e.g.  "chemical  analyses  of 
fine-qrained  lunar  rocks",  any  hypothesis 
created  for  "chemical  analyses"  will  monitor 
for  an  instantiation  of  the  concept  ROCK.  If 
found,  it  will  qive  additional  support  to  the 
theory  that  what  is  being  discussed  is  indeed 
the  chemical  analyses  of  some  rock. 

Evaluation  Mechanisms 

A  notion  central  to  the  control 
framework  is  that  of  evaluations  one  cannot 
afford  to  spend  time  on  activities  unlikely 
to  produce  qood  results.  The  various  scores 
associated  with  a  theory  are  used  by  Control 
to  allocate  its  resources  to  where  it  expects 
to  achieve  results.  In  this  section,  we 
discuss  how  knowledge  is  brought  to  bear  in 
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computing  these  scores. 

The  score  of  a  word  match  depends  on  how 
well  each  of  the  phonemes  in  the  phonetic 
spelling  matches  the  corresponding  sound 
description  lr  the  segment  lattice.  Among 
the  factors  taxen  into  account  in  making  this 
match  are  sv.h  things  asi 

1)  A  priori  information  about  the  similarity 

of  sounds  (e,g.  ( i )  is  more  similar  to 

(I)  than  to  la] .) 

2)  Cues  from  comparing  the  actual  duration  of 
a  segment  with  duration  information 
derivable  from  the  phonetic  spelling  using 
vowel  tenseness  and  stress. 

3)  The  likelihood  of  missing  or  extra 
segments.  This  is  determined  both  from 
empirical  studies  of  the  segmentation 
errors  which  are  made  by  the 
acoustic-phonetic  programs  and  from 
phonological  rules  which  indicate  the 
sounds  in  each  phonetic  spelling  which  are 
likely  to  be  missing  or  extra. 

4)  The  length  of  the  word.  Long  words  which 
match  well  get  a  boost  in  score  because  it 
is  relatively  unlikely  that  good,  lonq 
word  matches  would  be  detected  at  random. 


The  score  of  a  theory  is  a  weighted  sum 
of  its  lexical,  syntactic,  semantic  and 
pragmatic  scores.  The  lexical  score  depends 
on  the  average  word  match  score  for  the  words 
in  that  theory,  the  number  of  adjacent  word 
matches,  and  acoustic  effects  at  their 
boundaries.  The  semantic  score  is  based  on 
an  evaluation  of  the  conceptual  structures 
that  semantics  has  built,  reflecting  whether 
they  are  complete  or  lack  some  obligatory 
component.  In  the  latter  case,  semantic 
confidence  in  the  theory  is  lowered. 

The  syntactic  evaluation  is  based  on  the 
ability  to  assign  syntactic  structure  to  the 
hypothesis.  Using  an  augmented  transition 
network  grammar  13]  and  a  parser  capable  of 
working  with  disjoint  sequences  of  word 
matches,  the  syntactic  component  tries  to 
parse  each  such  sequence  and  decide  whether 
sequences  could  be  joined  into  a  larger 
syntactic  structure.  If  a  word  match 
sequence  fails  to  parse,  or  if  two  nearby 
sequences  cannot  be  bridged  in  any  way, 
syntactic  confidence  in  the  hypothesis  w.ll 
be  low. 

Currently,  SPEEC1ILIS  contains  very 
limited  pragmatic  knowledge:  only  the  most 
rudimentary  speaker  and  context  models  are 
available  for  use  in  evaluating  a  theory. 
Observing  the  relationships  postulated  by 
syntax  and  semantics,  the  pragmatic  component 
evaluates  the  likelihood  of  an  utterance  that 
would  contain  them.  For  example,  in  the 
context  of  question- answering,  questions  and 
commands  are  more  likely  than  statements :  so 
pragmatics  looks  for  syntactic  evidence  of 
sentence  type  in  making  its  evaluation.  The 
question-answering  context  also  makes  certain 
semantic  concepts  more  likely  than  others. 
For  example,  the  concept  of  t.ie  machine 
giving  the  user  something  or  of  the  user 


needing  something  is  more  likely  to  be 
expressed  than  any  particular  concept,  such 
as  that  of  spectrooraphic  analysis.  The 
praomatic  component  uses  the  conceptual 
structures  that  semantics  has  built  to 
evaluate  their  likelihood  of  occurrence. 
(This  evaluation  is  currently  user 
independent,  but  we  expect  eventually  to  deal 
with  a  dynamically  developed  model  of  the 
user’s  interest.) 

There  is  a  further  evaluation  based  on 
the  consistency  of  the  semantic  and  syntactic 
structures.  Associated  with  each  conceptual 
structure  that  semantics  has  built  is  a 
condensed  description  of  the  ways  in  which 
that  structure  might  be  realized 
syntactically.  If  none  of  the  structures 
that  syntax  can  build  correspond  to  these  , 
this  discrepency  lowers  the  likelihood  of  the 
theory  actually  representing  part  or  all  of 
the  original  utterance. 

An  event  is  evaluated  in  the  same  way  as 
a  theory:  that  is,  the  score  of  an  event  will 
reflect  the  score  of  the  suggested  new 
theory. 

A  Control  Strategy 

Within  the  framework  of  word  matches, 
theories,  evaluation  mechanisms,  etc.,  a 
preliminary  control  strategy  is  undergoing 
computer  implementation.  In  this  strategy, 
the  proposals,  theories  and  events  that  occur 
during  processing  are  evaluated  and  placed  on 
three  separate  queues,  ordered  by  the  scores 
of  their  elements.  The  basic  characteristic 
of  this  strategy  is  to  select  elements  from 
the  tops  of  these  queues  and  process  them. 

The  first  activity  of  the  control 
proorams  is  to  call  the  acoustic-phonetic  and 
phonological  programs  to  construct  an  initial 
segment  lattice  from  the  speech  signal.  A 
word  lattice  of  robust  word-matches  is  then 
constructed  by  a  prooram  which  scans  the 
segment  lattice  with  the  aid  of  the 
dictionary.  In  addition,  a  set  of  words 
which  are  pragmatically  likely  to  begin  an 
utterance  are  matched  at  the  beginning  of  the 
segment  lattice.  As  each  such  word  match  is 
found,  it  is  entered  into  the  word  lattice 
and  given  to  the  semantic  component  for 
analysis.  If  the  word  has  semantic  content, 
a  theory  is  created  for  the  word  match, 
designating  all  semantic  contexts  in  which  it 
could  appear.  If  a  monitor  is  noticed 
indicating  that  a  word  fits  into  thp  semantic 
context  of  a  theory  which  was  created 
earlier,  an  event  is  created  which  associates 
the  new  word  match  with  the  old  theory. 
Proposals  for  specific  content  words  which 
are  likely  to  appear  adjacent  to  the  new  word 
match  are  created  and  ad’ed  to  the  proposals 
queue . 

For  each  new  word  match,  appropriate 
inflexional  endings  and  auxiliary  verbs  are 
matched  against  the  segment  lattice  and 
associated  with  the  word  match  it  they  match 
well. 

After  the  initial  set  of  robust  word 
matches  are  examined,  the  proposals  that  are 
likely  to  be  productive  are  processed,  thus 
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introducing  new  word  matches  and  triqgerinq  a 
new  round  of  semantic  analysis.  The  events 
at  the  top  of  the  event  queue  are  then  handed 
hack  to  the  semantic  component  I  or  further 
processing.  For  each  event,  a  n  -w  theory  is 
created  with  a  modified  semantic  zontext  and 
entered  into  the  theory  queue.  This  may 
result  in  additional  events,  as  semantics 

notices  other  word  matches  ia  the  word 

lattice  which  fit  into  the  modified  context. 

In  this  way,  semantics  assembles  meaningful 
sets  of  content  words. 

As  new  theories  are  created,  each  is 
examined  tc  determine  whether  it  might  be 
fruitful  to  call  upon  syntactic  knowledge  to 
develop  further  support  for  it.  Since  the 
number  of  possible  parsings  decreases  with 
the  number  of  adjacent  or  "close  word 

matches,  this  decision  is  made  on  the  basis 

of  the  number  of  adjacent  word  matches  in  the 
theory,  the  size  of  the  gaps  between  word 
match  sequences,  and  the  absence  of  content 
words  in  the  word  lattice  which  would  be 
added  to  the  theory  by  semantics. 

Syntactic  knowledge  is  used  to  postulate 
grammatical  structures  that  may  obtain  among 
the  words  in  a  theory.  For  example,  for  ... 
people  done  chemical  analyses. ••  ,  syntax 
could  suggest  that  "people"  is  the  subject  of 
the  verb  "done",  "chemical  analy  es"  is  the 
noun-phrase  object,  and  that  an  auxiliary 
verb  appears  somewhere  in  the  utterance 
(probably  at  the  beginning)  to  modify  the 
past  participle  "done".  Such  grammatical 
information  is  checked  for  consistency  with 
the  postulated  semantic  structures,  to 
determine  for  example  whether  it  makes 
semantic  sense  for  "people"  to  do  something. 
Function  words  (e.g.  determiners  and 
prepositions)  which  are  likely  to  appear 
adjacent  to  a  sequence  of  word  matches  are 
proposed  bv  syntax  in  the  context  of  these 
grammatical  structures,  and  added  to  the 
theory  as  a  refinement  if  they  are  found. 
Each  small  gap  between  seguences  of  word 
matches  is  analyzed,  and  a  strong  attempt  is 
made  to  find  a  small  word  which  fits.  If 
none  is  found,  it  is  likely  that  one  of  the 
word  matches  adjacent  to  the  gap  is  wrong. 

An  Example 

To  illustrate  the  operation  of  the  above 
control  strategy,  we  will  consider  a  specific 
example.  The  segment  lattice  shown  in  Figure 
1  w is  constructed  by  hand  from  a  speech 
spectroqram  during  a  study  of  human 

performance  in  spectrogram  reading 

experiments  12].  The  word  lattice  shown 
schematically  in  Figure  2  was  constructed 
from  it  by  looking  for  robust  word  matches 
and  possible  adjuncts  (inflections  and 

auxiliaries)  and  by  trying  to  match 

pragmatically  likely  words  in  sentence 

initial  positi  jn. 

Following  this  first  pass  in  which  word 
matches  were  entered  in  the  word  lattice  and 
given  to  semantics  for  processing,  there  were 
42  theories  and  48  events.  (Some  pruning  was 
done  to  eliminate  unlikely  events.)  The  five 
events  at  the  top  of  the  event  queue  were 
ones  linking  "chemical"  and  "analyses  , 
"modal"  and  "analyses",  "chemical"  and 


"analysis",  "modal"  and  "analysis",  and 
"ratal"  and  "analyses".  (One  can  aralyze  a 
rock  for  its  metal  content.) 


Processing  these  five  events  led  to  the 
creation  of  five  new  theories  and  55  new 
events.  At  this  point,  the  best  events 
called  for  linking: 

1) 

■give"  (initial 
analyses" 

positioh) 

,  id  "chemical 

2) 

"qive"  (initial 
analyses" 

position) 

and  "modal 

3) 

"give"  (initial 
analysis" 

position) 

and  "chemical 

4) 

"print'  (initial 
analyses" 

position) 

and  "chemical 

5)  "have"  (initial  position) 
■chemical  analyses" 


"done"  and 


Notice  that  the  top  four  events  were  quite 
reasonable  though  incorrect.  Five  new 
theories  and  20  new  events  were  created 
during  this  round  of  processing. 


The  next  round  of  event  processing 
brought  the  following  five  events  to  the  top 
of  the  queues 

1)  "have  ...  done  chemical  analyses*  and 

"people" 

2)  "have  ...  done  chemical  analyses"  and 

"rock" 


3)  "give  ...  chemical 
(following  "give") 


analyses* 


4) 


•give  ..  chemical 
(following  "give") 


analyses" 


and  "me" 
and  "us" 


5)  "give  ...  chemical  analyses"  and  "I" 
(following  "give") 

Notice  that  the  top  two  events  were  each 
filling  up  a  different  semantic  role  in  the 
concept  of  doing  a  chemical  analysis  —  the 
agent  of  the  doing  and  the  object  of  the 
analysis.  As  to  the  "give  I*  event, 
semantics  does  not  know  that  this  is 
syntactically  incorrect.  Again  five  new 
theories  were  created  during  this  round,  but 
these  resulted  in  only  the  five  events  shown 


above , 


At  the  start  of  the  fourth  round  of 
event  processing,  the  five  best  events  were: 


1) 

"have 

•  •  • 

people  done  chemical  analyses" 

and  "rock" 

2) 

"have 

done  chemical  analyses  ... 

rock" 

and 

■people" 

3) 

"give 

me 

...  chemical  analyses"  and 

"rock1 

N 

4) 

"give 

US 

...  chemical  analyses"  and 

"rock 

m 

5) 

"give 

i  .. 

.  chemical  analyses"  and  "rock" 
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Notice  that  the  top  two  events  would  result 
in  the  same  theory.  However,  before  a  theory 
is  created,  the  control  strategy  checks  that 
no  such  theory  already  exists.  If  one  does, 
processing  is  halted  on  that  event  so  that 
duplication  does  not  occur.  (However,  this 
ability  to  arrive  at  the  same  theory  from 
several  directions  is  necessary  since  it 
allows  us  to  put  toqether  inconplete 
structures,  irregardless  of  which  pieces  are 
missing. )  The  four  resulting  theories  were 
senantically  completes  both  agent  and  object 
of  doing*  had  been  identified,  as  had  the 
object  of  "chemical  analyses",  and  agent, 
recipient  and  object  of  "give".  At  this 
point,  semantics  cound  not  contribute 
anything  to  these  good  theories,  and  they 
were  sent  off  to  syntax. 

Syntax  noticed  the  determiner  "any*  in 
the  word  lattice  which  could  precede  "people" 
syntactically,  and  it  created  an  event  which 
would  refine  the  first  theory  with  the  word 
match  for  "any".  In  addition,  syntax 
proposed  determiners  before  "rock",  sinc» 
none  occurred  in  the  word  lattice.  This  and 
additional  proposals  brought  word  matches  for 
this*  and  "in"  into  the  word  lattice.  These 
were  added  to  the  theory  by  syntax,  resulting 
in  a  semantically  meaningful,  grammatically 
correct  one  which  spanned  the  utterance. 
This  was,  at  the  time,  sufficient  criteria 
for  accepting  the  theory  "Have  any  people 
done  chemical  analyses  on  this  rock"  as  a 
correct  understanding  of  the  utterance. 

Conclusion 


Both  the  control  framework  and  strategy 
presented  above  are  incomplete  since  many 
problems  have  still  to  be  faced.  Our  most 
difficult  current  problem  involves 
recognizing  the  state  when  the  system  is  just 
thrashing  around,  when  no  theory  deriving 
from  our  current  strategies  is  going  to 
emerge  as  a  good  candidate  for  the  whole 
utterance.  We  need  to  use  our  knowledge 
sources  to  decide  which  pieces  of  existing 
theories  are  most  reliable,  and  which  pieces 
should  be  tossed  out.  To  get  a  better 
feeling  for  the  possibilities,  we  expect  to 
use  the  technique  of  "incremental  simulation" 
[5],  in  which  a  person  simulates  a  part  of 
the  system  which  is  not  yet  formulated  to 
gain  insight  into  how  it  might  work. 

Another  pressing  problem  is  the  need  for 
a  more  rigorous  foundation  for  measuring 
confidence  in  evidence  and  combining  such 
measures  into  measures  of  confidence  in 
theories  and  events.  As  complexity 
increases,  our  current  methods  will  become 
more  difficult  to  manage. 

Other  problens  will  arise  from  our 
imminent  transition  to  a  larger  vocabulary 
and  projected  transition  into  different  task 
domains.  The  attempt  to  solve  all  these 
problems  will  test  the  adequacy  of  our 
control  framework  in  deiling  with  a  world  of 
uncertain  and  incomplete  information. 
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Introduction 


This  paper  will  address  four  questions! 
(1)  What  makes  the  parsing  of  speech 
significantly  different  from  the  parsing  of 
text?  (2)  What  role  does  syntax  play  in 
speech  understanding?  (3)  What  strengths  and 
weaknesses  do  existing  methods  of  parsing 
text  have  in  light  of  the  answer  to  the 
previous  questions?  (4)  How  does  the  BntJ 
speech  parser  cope  with  the  problems 
presented? 

Problems  in  Parsing  Speech 

Parsing  speech  is  a  much  more  difficult 
problem  than  parsing  text.  Because  speech  is 
continuous,  word  and  senter.ee  boundaries  are 
usually  obscured.  Also,  inaccurate  or  hasty 
articulation  and  the  normal  variation  in  the 
pronunciation  of  phonemes  cause  the 
pronunciation  of  a  word  in  context  to  be  very 
different  from  that  in  isolation.  Due  to 
contextual  effects,  in  order  to  uniquely 
identify  a  word  in  speech  using  acoustic 
parameters  it  may  be  necessary  to  know  the 
words  around  it,  but  in  order  to  identify 
those  words,  their  context,  including  the 
original  word,  may  be  needed.  The  only  way 
to  break  this  cycle  of  impossibility  is  to 
allow  considerable  ambiguity  in  the  word 
identification  process.  Acoustic  processing 
results  in  uncertainty  in  the  identification 
of  phonemes  and,  therefore,  of  words, 
especially  small  function  words  such  as 
"the,*  "a,"  "of,"  "have,"  "did,*  etc.  Even 
if  the  acoustic  component  could  identify 
phonemes  uniquely,  some  ambiguity  would  be 
inevitable  because  of  the  occurrence  of 
homonyms,  as  in  the  sequence  "wait/weight 
four/for/fore  the  bare/bear,"  and  because 
word  boundaries  may  be  shifted,  as  in 
tea  meeting/team  eating/team  meeting."  In 
text  processing  there  is  no  such  inherent 
ambiguity,  but  any  speech  understanding 
system  must  be  able  to  deal  with  it. 

The  implication  of  all  this  for  parsing 
is  that  the  input  to  a  parser  for  speech 
cannot  be  a  string  of  uniquely  determined 
words  but  must  be  something  like  a  lattice  of 
words  (see  Figure  1) ,  When  the  parser  wants 
the  "next  word"  of  the  input  it  must  be  able 
to  deal  with  a  list  of  possible  words  and 
must  be  prepared  to  cope  with  the  possibility 
that  the  right  word  is  not  included  in  chat 
list,  it  may  also  be  the  case  that  no  usable 
word  can  be  found  at  one  or  more  places  in 
the  utterance,  so  the  parser  must  also  be 
able  to  deal  with  gaps  in  its  input. 
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Figure  1.  A  partial  word  lattice 

When  processing  text,  a  parser  could 
reasonably  take  advantage  of  a  number  of 
extra-linguistic  indicators  such  as 
punctuation  marks  (a  period  to  delimit  a 
sentence,  commas  to  disambiguate  certain 
complex  conjunction  constructions,  etc.), 
capitalization  (to  indicate  the  start  of  a 
sentence  or  to  distinguish  proper  nouns  such 
as  "Pat"  from  other  words  such  as  the  verb 
"pat"),  italics,  underlining,  quotation 
marks,  and  parentheses.  (To  illustrate  the 
importance  of  these  factors  to  comprehension, 
consider  the  following  grammatical  but 
unpunctuated  string:  that  which  is  is  that 
which  is  not  is  not  is  not  that  so) .  All  of 
these  cues  are  missing  in  speech.  They  are 
compensated  for  by  the  use  of  pauses,  stress, 
changes  in  duration,  pitch,  and  loudness,  and 
other  prosodic  features.  Unfortunately  the 
current  lack  of  knowledge  about  the  acoustic 
correlates  of  prosodic  features  makes  it 
almost  impossible  to  use  this  rich  source  of 
information  in  speech  understanding  systems, 
so  c-irrent  speech  parsers  must  cope  with  the 
increased  ambiguity  resulting  from  this  lack 
of  information. 

The  Purpose  of  Syntax 

In  most  systems  which  work  with  natural 
language  the  purpose  of  the  parser  is  to 
provide  a  representation  of  the  syntactic 
units  of  the  input  and  their  relationships  to 
one  another .  This  representation  is 
frequently  a  "deep  structure"  tree  (as  in 
Figure  2)  which  may  then  undergo  semantic 
analysis  or  interpretation.  The  creation  of 
a  self-contained  syntactic  structure  is  not 
absolutely  mandatory  if  enough  semantic  and 
interpretive  processing  is  done  together  with 
the  parsing,  but  in  any  case  the  syntactic 
component  must  be  able  to  confirm  that  f:.g 
input  is  grammatically  correct,  and  we  will 
assume  that  some  structure  for  it  is  also 
produced.  A  parser  for  speech,  however,  must 
do  more  than  this.  In  addition  to  detecting 
syntactic  ambiguities  (e.g.  "I  gave  her  cat 
food.")  syntax  must  aid  in  selecting  a 
syntactically  well-formed  sequence  of  words 
fi  -the  many  sequences  of  words  which  arc 
pos  lible  in  the  word  lattice. 
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Figure  2.  A  deep  structure  for 
"Do  we  have  samples  which  contain  silicon?" 

Text  parsers  arc  designed  on  the 
assumption  that  the  words  given  as  input  will 
in  fact  form  a  grammatical  sentence,  so  the 
duty  of  the  parser  is  merely  to  determine  the 
structure (s)  of  the  sentence.  A  speech 
parser,  however,  must  know  that  some  (in 
fact,  many)  of  its  potential  input  sequences 
will  be  ungrammatical,  and  it  must  be  able  to 
detect  and  reject  those  sequences  as  early  as 
possible. 

Another  goal  of  any  speech  parser  must 
be  to  predict  words  or  syntactic  categories 
which  could  fill  gaps  in  the  word  lattice. 
The  type  of  predictions  which  can  be  made 
depends  on  the  nature  of  the  grammar  being 
used  and  the  amount  of  context  which  is  taken 
into  account  when  making  the  predictions. 

Existing  Models 

Assuming  that  the  extensive  body  of  work 
which  has  been  done  in  the  analysis  of  text 
has  something  to  offer  for  the  analysis  of 
speech,  let  us  examine  two  of  the  techniques 
which  have  been  used.  For  a  more  complete 
description  of  these  methods  see  the  book  by 
Aho  and  Ullmanll]. 

Top  down  methods  of  parsing  (so  called 
because  they  construct  the  deep  structure 
tree  by  beginning  at  the  root  node  and 
working  down)  are  lef t-to-right  and  usually 
predictive;  they  beqir  by  searching  for  a 
component  of  a  given  type  and  operate 
recursively,  trying  a.l  possible  ways  of 
building  the  constituent  before  failing.  The 
ability  of  this  method  to  predict,  at  any 
point,  the  set  of  acceptable  constructions 
which  could  appear  in  the  input  as  a  function 
of  the  context  to  the  le^t  is  its  strongest 
advantage.  In  speed;  analysis,  the 
predictions  may  be  used  to  eliminate  some  of 
the  possible  "next  words"  in  the  word 
lattice.  This  method  has  the  disadvantage 
that  if  there  is  an  error  at  or  near  the 
beginning  of  the  input,  the  parser  may  not 
only  take  a  long  time  to  fail  but  will 
consider  the  last  portion  of  the  string  only 
in  the  context  of  the  earlier  (erroneous) 
part,  thus  little  if  any  useful  information 
may  be  gained  about  the  structure  of  the  last 
part  of  the  input.  Unless  great  care  is 
taken  to  prevent  duplication  of  effort  when 
re-parsing  portions  of  the  input  (by  the  use 


of  a  well-formed-substring  table  or  by 
compacting  methods  such  as  Earley's  algorithm 
[ l,p320 ;  2)),  the  lexical  ambiguity  of  speech 
input  could  cause  an  exponential  increase  in 
the  amount  of  work  required. 

Bottom  up  techniques  such  as  Cocke's 
algorithm  (l,p314)  begin  with  the  leaves  of 
an  analysis  tree  and  work  up.  First,  all 
possible  substrings  of  length  one  are 
considered  and  all  one— word  constituents 
formed.  Then  usir.g  this  information  all 
pairs  of  adjacent  words  are  considered  and 
all  two-word  constituents  are  formed.  Then 
all  adjacent  three-,  four-,  five-, .. .word 
substrings  are  considered  until  the  length  of 
the  string  is  reached.  This  method  is 
neither  loft-to-riqht  nor  right- to-left  and 
has  the  advantage  of  working  with  isolated 
sections  of  the  input  so  that  an  error  at  one 
point  will  not  prevent  a  correct  analysis  of 
another  portion  of  the  string.  It 
unfortunately  requires  that  all  possible 
parsings  of  all  sections  of  the  input  be 
found  in  parallel  —  a  procedure  which  is 
enormously  wasteful  of  space  and  time  even 
when  a  single  string  is  being  processed.  The 
multiple  words  produced  by  an  acoustic 
analyzer  and  lexical  retriever  together  with 
the  multiple  syntactic  categories  for  many  of 
those  words  and  the  multiple  ways  they  can  be 
syntactically  combined  when  only  very  local 
context  is  used  exacerbate  the  problem  to 
such  an  extent  that  a  totally  bottom  up 
speech  parser  would  be  unthinkably  slow. 

What  is  needed  is  a  scheme  which  can 
merge  top  down  techniques  with  bottom  up  ones 
to  conbine  directed,  predictive  analysis  with 
immunity  to  errors  in  non-local  context.  The 
formalism  of  a  TRANSITION  NETWORK  GRAMMAR 
(TNG)  seems  particularly  well  suited  to  such 
adaptation,  for  the  following  reasons.  TNG's 
allow  easy  prediction  to  both  the  right  and 
left  of  any  word  of  input.  They  are 
constructed  in  such  a  way  that  ambiguous 
information  is  separated  only  in  the  truly 
ambiguous  part,  allowing  merging  of  the  rest 
of  the  analysis.  Some  relief  from  contextual 
errors  can  be  gained  by  limiting  the  context 
of  any  word  in  the  input  to  only  those  words 
which  may  be  in  the  same  constituent. 
Finally,  althoueh  TNG's  were  designed  to 
drive  a  parser  in  top  down  mode,  bottom  up 
information  is  easily  accessible. 

For  a  complete  description  of  TNG's  and 
a  text  parser  using  them,  see  (4)  and  [5], 
Briefly,  a  TNG  looks  something  like  a  finite 
state  network,  with  two  important  additions. 
The  network  may  be  recursive,  that  is,  the 
label  on  some  arc  may  call  for  a  structure 
created  by  recursively  re-applying  the 
network.  Second,  there  may  be  a  list  of 
ACTIONS  on  each  arc  whose  purpose  is  to 
perform  tests  or  to  create  bits  of  tree 
structure  and  store  them  in  REGISTERS  which 
may  be  thought  of  as  free  variables  whose 
values  are  accessible  to  subsequent  arcs.  In 
this  manner,  register  contents  can  be 
combined  and  built  up  to  finally  produce  a 
deep  structure  analysis  of  the  sentence. 
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Figure  3  shows  a  diagram  of  a  simple 
TNO.  The  names  of  the  states  are  within  the 
circles.  Tiie  types  of  arcs  shown  are:  CAT  X, 
which  looks  at  the  strinq  for  a  word  of 
syntactic  cateqory  X;  JUMP,  whicii  moves  to 
another  state  without  going  on  to  the  next 
word  of  imutj  PUSH  X,  which  calls  the 
network  recursively  beqinning  at  state  X;  and 
POP,  whicii  indicates  the  end  of  processing 
th®  cur rent  level  and  specifies  a  schema  for 
building  a  piece  of  tree  structure  from  the 
contents  of  the  registers. 

The  actions  on  the  arcs  are:  (SETR  X  Y) , 
which  replaces  the  contents  of  register  X  by 
the  value  of  Y;  (ADDR  X  Y) ,  which  adds  the 
value  of  Y  to  the  contents  of  register  X 
without  destroying  the  old  value;  (GLTF  X) 
which  returns  the  value  of  the  syntactic 
feature  x  associated  witli  the  current  word; 
and  (ABORTIF  (NOT  (DETAGREE) ) )  which  blocks 
the  arc  if  the  determiner  does  not  aoree  with 
the  head  noun  of  a  noun  phrase  (as  in  "a 
rocks") .  Other  actions  not  shown  in  the 
example  can  access  previous  register  contents 
and  test  arbitrary  predicates  in  order  to 
perform  some  actions  conditionally.  The 
abort  option  is  particularly  useful  for 
detecting  errors  in  the  input  and  blockinq 
the  analysis. 

The  symbol  *  is  used  to  refer  to  the 
current  word  of  input,  or,  on  a  PUSH  arc,  to 
the  tree  structure  returned  by  the  recursive 
call.  When  operated  as  a  text  parser,  the 
TNG  mechanism  is  top  down. 

The  BBN  Speech  Parser 

The  syntactic  component  of  BRN's  speech 
system  is  one  of  a  number  of  proces_es  which 
work  together  to  understand  an  utterance. 
For  an  overview  of  the  entire  system,  see 
Woods'  paper  in  this  volume  J  6 1 .  Very 
briefly,  the  structure  of  the  system  may  be 
described  as  follows.  There  are  a  number  of 
components  (Acoustics,  Lexical  Retrieval, 
Syntax,  Semantics,  Pragmatics,  and  Control) 
which  are  called  into  action  under  the 
direction  of  the  control  component. 
Acoustic,  phonological,  and  lexical  processes 
produce  from  the  acoustic  siqnal  a  lattice  of 
word  matches  for  words  with  a  high  lexical 
score,  similar  to  tniit  in  Figure  1.  only 
words  of  more  than  throe  phonemes  are  placed 
in  the  lattice  initially  since  smaller  words 
tend  to  match  well  everywhere  and  flood  the 
lattice. 

The  semantic  component  selects  subsets 
of  this  lattice  based  on  semantic 
relationships  among  the  words.  Such  a  subset 
(in  the  form  of  a  word  match  list)  is 
associated  with  semantic,  pragmatic  and 
(initially  empty)  syntactic  information  and 
is  termed  a  THEORY.  It  is  a  hypothesis  about 
the  contenc  of  the  utterance.  For  the 
remainder  of  this  paper,  the  term  "theory" 
will  be  used  to  refer  to  the  word  match  list 
alone  as  well  as  to  the  larger  structure  of 
which  it  is  a  part. 

When  a  theory  has  been  constructed  to 
which  Semantics  can  add  no  more  words,  it  may 
be  sent  to  Syntax  for  processing.  The 
initial  input  to  the  parser,  then,  is  a  list 
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of  word  matches.  This  list  will  probably  not 
span  the  utterance;  there  will  be  islands  of 
wort!  matches  with  gaps  between  them.  Each 
wort!  match  may  represent  either  a  single  word 
with  definite  boundaries,  a  sinqle  word  with 
"fuzzy"  boundaries,  a  word  together  with 
possible  inflectional  endings,  a  group  of 
words  whicii  have  the  same  semantic 
associations,  or  a  combination  of  any  of  the 
above .  Using  brackets  to  delimit  word 
matches  and  numbers  to  indicate  the 
boundaries  in  the  word  lattice,  a  tyoical 
theory  for  the  utterance  "List  all  the 
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process,  it  processes  the  islands  cf  words  in 
it  from  left  to  right  and  attempts  to  create 
for  each  island  of  words  the  PATHS  (sequences 
of  TRANSITIONS  and  CONFIGURATIONS,  defined 
below)  which  represent  the  ways  in  which  the 
island  of  words  might  be  accepted  by  the 
grammar  if  surrounded  by  some  suitable 
context.  Then  Syntax  tries  to  extend  the 
theory  by  finding  (in  the  word  lattice)  or 
predicting  words  or  syntactic  classes  which 
would  provide  a  context  consistent  with  its 
analyses.  When  Syntax  has  finished 
processing  a  theory,  it  adds  to  the  syntactic 
part  of  the  theory  the  configurations  and 
transitions  used  in  its  analysis  and  returns 
to  Control  a  score  which  is  a  measure  of  the 
amount  of  syntactic  information  qained  by  the 
analysis . 

Each  configuration  represents  a  state  of 
tiie  grammar  which  the  parser  could  be  in  at  a 
particular  boundary  point  in  the  current 
theory.  Each  transition  represents  a  change 
from  one  configuration  to  another  by 
following  an  arc  of  the  grammar.  A 
transition  contains  information  about  the  arc 
which  it  reoresents,  the  word  or  words  used 
by  the  transition  and  the  possible  register 
contents  resulting  from  execution  of  the 
actions  on  the  specified  arc.  Since  a  given 
transition  may  have  any  number  of  transitions 
to  its  left  (because  different  contexts  may 
precede  it),  and  since  the  actions  on  ar  arc 
frequently  make  use  of  the  context  to  the 
left  by  looking  at  register  sets,  there  may 
be  a  number  of  sets  of  different  register 
contents  associated  with  the  transition  — 
although  not  necessarilv  one  for  eacli 
possible  context  because  sharing  of  register 
information  reduces  the  number  of  sets 
required  as  we  shall  r.ee  below. 

Syntax  can  create  data  ohjects  called 
MONITORS,  EVENTS ,  and  PROPOSALS  which 
represent  instructions  to  Control.  A  monitor 
is  a  demon  which  is  placed  on  a  particular 
point  in  the  word  lattice.  The  monitor's  job 
is  to  watch  for  a  word  possessing  some 
specific  characteristic  (such  as  a  particular 
part  of  speech)  to  be  placed  in  the  lattice 
at  that  point.  If  and  when  a  monitor  is 
activated,  it  creates  an  event,  which  is  a 
record  of  the  word  which  caused  the  event, 
the  theory  which  caused  the  monitor  to  be 
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Figure  4.  Hap  of  transitions  and  configurations 
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set,  and  an  instruction  indicating  which 
component  to  call  to  process  the  event.  When 
an  event  is  processed,  a  new  theory  is 
created  from  the  old  one  by  including  the  new 
word.  Syntax  can  create  events  directly 
whenever  it  notices  a  word  already  in  the 
word  lattice  which  could  be  used  to  extend 
the  theory  it  is  processing.  Monitors  are 
passive  in  the  sense  that  they  merely  wait 
for  a  word  which  can  activate  them  to  appear. 
They  do  nothing  to  cause  such  a  word  to  be 
found.  A  proposal,  on  the  other  hand,  is,  as 
far  as  Syntax  is  concerned,  a  command  which 
causes  Control  to  activate  the  word  match 
component  to  look  specifically  for  a 
particular  word  or  syntactic  category  (whose 
members  are  enumerated)  at  a  particular  place 
in  the  word  lattice.  If  a  word  is  found,  the 
corresponding  monitor  will  be  activated  and 
an  event  created. 


Working  through  a  small  example  should 
help  to  explain  the  features  of  the  parser 
and  the  data  structures  it  builds.  Consider 
the  theory  which  was  shown  above.  Figure  4 
shows  a  map  of  some  of  the  configurations 
(boxes)  and  transitions  (arrows)  which  exist 
after  the  second  island  of  the  theory 
("sample  (s, ")  lias  been  analyzed.  The 
transitions  are  numbered  in  order  of  their 
creation  and  show  the  arc  they  represent  and 
the  sets  of  associated  register  contents. 
Let  us  assume  that  the  semantic  component  had 
attached  to  the  theory  the  constraint  that 
"sample  (s)"  be  used  as  a  noun,  not  as  a  verb 
or  as  an  adjective  ("(he)  samples  the  rocks," 
"(the)  sample  number").  Using  this  semantic 
restriction  together  with  an  appropriate 
index  for  the  arcs  of  the  grammar  (refer  to 
Figure  3),  the  parser  can  determine  that  the 
first  CAT  N  arc  from  state  NP/DET  must  be 
used  to  process  the  word  "sample (s)"  since 
the  other  CAT  N  arc  actually  uses  the  word  as 
an  adjective.  In  general  there  may  not  be 
semantic  constraints  on  how  the  first  word  of 
an  island  can  ’  syntactically  realized,  so 


all  arcs  would  be 
the  word  as  any  of 
speech.  Thus  the 
bottom  up  mode. 


found  which  could  process 
its  possible  parts  of 
parsing  is  beoun  in  a 


Considering  the  plural  possibility 
first,  a  transition  is  made  from  a 
configuration  for  state  NP/DET  at  position  7 
to  a  configuration  for  state  NP/N  at 
position  13,  and  the  registers  N  and  NU  are 
set  by  the  actions  on  the  arc.  The  singular 
case  is  "fuzzy"  since  the  end  position  can  be 
either  12  or  13,  but  the  register  contents 
will  be  the  same  in  either  case.  Instead  of 
creating  two  transitions  with  dunlicate 
information,  one  transition  (number  2)  is 
created  with  multiple  terminations.  Multiple 
initial  configurations  are  also  permitted. 

Now  consider  what  could  occur  to  the 
left  of  the  island.  Reference  to  the  grammar 
shows  that  in  order  to  get  to  state  NP/DET 
the  parser  must  take  either  tie  JUMF  arc  from 
NP/  or  one  of  the  CAT  ADJ,  CAT  N,  or  CAT  DET 
arcs.  A  transition  for  the  JUMP  arc  can  be 
created  immediately  since  it  needs  no 
context.  The  word  lattice  is  checked  for  the 
existence  of  a  word  of  cateaory  ADJ,  n,  or 
DET  and  if  one  is  found  an  event  relating  it 
to  the  current  theory  is  created.  Whether  or 


not  such  a  word  is  found,  monitors  are  set  to 
watch  the  word  lattice  for  an  occurrence  of  a 
noun,  adjective,  or  determiner  at  some  later 
time.  Syntax  remembers  the  arcs  which  caused 
the  monitors  to  be  set  and  the  configuration 
at  that  point  (indicated  by  the  dotted  arrows 
in  Figure  4)  in  order  to  be  able  to  process 
an  event  should  one  occur. 

PUSH  arcs,  when  encountered,  cause  an 
internal  syntactic  monitor  to  be  set  at  a 
position  in  the  parser's 
well-formed-substring  table  (WFST)  where  all 
constituents  are  placed  when  they  are 
created.  The  PUSH  arc  also  causes  creation 
of  a  configuration  for  the  state  PUSHed  to  in 
order  to  begin  processing  for  the 
constituent. 

Going  back  to  our  example,  we  have  left 
open  two  configurations  (NP/N  at  12  and  NP/M 
at  13)  which  may  be  considered  for  extension. 
Currently  all  open  configurations  are 
processed,  but  this  results  in  many  partial 
paths  through  the  island.  Actually  they 
should  be  ordered  according  to  the  goodness 
of  the  paths  which  terminate  on  them.  We  are 
currently  working  on  a  formula  for 
calculating  a  score  for  a  path,  based  on  such 
things  as  the  number  of  registers  set  with 
unknown  values,  the  length  of  the  path,  and 
perhaps  even  the  lexical  score  of  the  words 
used.  By  trying  to  continue  only  the 
best-looking  paths  (but  remembering  the 
others)  we  cut  down  the  number  of 
possibilities  which  the  parser  must  explore. 

When  a  configuration  is  to  be  extended, 
the  arcs  from  its  state  are  tried  one  at  a 
time  in  top  down  fashion.  If  the  end  of  the 
island  has  been  reached,  arcs  which  require 
context  to  the  rinht  of  the  island  cause 
creation  of  events,  monitors,  and  proposals 
just  as  they  did  on  the  left.  In  our 
example,  this  point  is  reached  after  the 
creation  of  transition  4  and  the  setting  of 
monitors  for  preoositional  phrases  and 
prepositions.  Whenever  a  path  becomes 
blocked,  a  simple  backup  procedure  is  invoked 
to  go  back  one  step  of  the  path  and  try 
another  of  the  alternatives  stored  there. 

Although  this  part  of  the  parser  is 
basically  top  down,  it  can  be  restricted  by 
bottom  up  information.  For  example,  whenever 
a  word  in  an  island  is  processed  which 
Semantics  has  hypothesized  must  be  used  in  a 
certain  syntactic  way,  only  the  arcs  of  the 
grammar  consistent  with  that  hypothesis  are 
allowed  to  extend  the  path  through  that  word. 

The  rest  of  Figure  4  shows  the 
transitions  and  additions  to  the  POP 
transitions  which  would  be  created  for  two 
events,  one  for  the  two  determiners  "the"  and 
"a"  and  then  one  for  the  adjective  "old". 
Notice  that  a  register  may  contain  a  set  of 
alternative  contents,  and  that  one  or  more  of 
these  alterna' ives  may  be  selected  for  use  by 
a  later  action.  The  test  on  the  POP  arc 
checks  agreement  between  determiner  and  head 
noun  and  prevents  noun  phrases  for  "sample," 
"old  sample,"  and  "a  samples"  from  being 
created. 
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There  are  several  features  of  the  parser 
which  the  above  example  does  not  show.  For 
example,  if  an  action  on  an  arc  requires  the 
contents  of  a  register  which  is  not  set,  a 
special  symbol  which  means  "unknown"  is  used 
in  place  of  the  desired  value.  POP 
transitions  are  prohibited  from  building 
structures  containing  an  unknown  value.  When 
a  trans  *on  is  made  which  joins  to  the  left 
end  of  path  which  contains  incomplete 
registers,  new  register  sets  are  added  to 
relevant  transitions  with  copies  of  the 
unknown  registers  replaced  by  the  correct 
valued,  and  any  pending  POP  transitions  which 
have  their  register  lists  completed  by  this 
procedure  may  construct  their  results. 

A  feature  currently  being  designed  for 
the  parser  will  allow  an  action  on  any  arc  to 
be  a  call  to  Semantics  to  test  the  contents 
of  various  registers  in  order  to  determine 
whether  or  not  that  particular  path  appears 
to  be  semantically  likely.  For  example,  if 
the  sequence  "green  zebra*  is  being  processed 
with  "green"  as  an  adjective  and  the  parser 
is  considering  the  arc  which  would  take 
"zebra"  as  the  head  noun.  Semantics  could  be 
asked  to  determine  how  well  the  adjective 
fits  the  noun.  Since  the  answer  would  be 
"not  well  at  all,"  the  parser  could  take  this 
as  an  indication  to  lower  the  score  for  that 
path  and  try  another  possibility,  such  as  the 
arc  which  would  accept  “zebra"  as  an 
adjective  and  look  for  another  noun 
(e.g.  "cage")  to  follow  it. 

Semantic  guidance  could  be  used  to 
answer  such  questions  as:  “Given  that  a 
particular  prepositional  phrase  has  been 
found  in  the  WFST  and  can  be  used  to  modify  a 
particular  noun,  would  the  result  be 
semantically  meaningful?"  or  "A  verb  is  about 
to  be  parsed,  and  the  subject  of  the  sentence 
is  known.  Could  the  noun  phrase  in  the 
subject  register  actually  serve  as  a  subject 
of  the  verb?"  Even  praqmatic  guidance  could 
be  used  in  a  similar  way  ("Is  it 
pragmatically  likely  that  this  verb  is 
passivized?"),  if  it  were  known  how  to 
structure  more  pragmatic  knowledge  in  a 
usable  way. 

Figure  4  shows  part  of  the  data  base 
constructed  for  one  theory  only.  As  other 
theories  are  processed,  they  add  to  the  same 
data  base  and  may  use  the  information  already 
there.  Thus,  syntactic  information  may  be 
shared  across  theories.  This  is  especially 
important  for  the  WFST,  since  once  a 
constituent  is  placed  there  it  is  available 
to  all  other  theories  without  re-parsing. 
Even  partial  paths  may  be  shared,  since  once 
a  configuration  or  transition  has  been 
created  it  is  never  duplicated  but  merely 
included  in  the  syntactic  part  of  any  theory 
which  can  use  it. 


Conclusion 

We  have  tried  to  show  that  one  of  the 
major  problems  facing  a  parser  for  soeech  is 
the  lexical  ambiguity  of  its  input.  The 
combinatorial  possibilities  induced  bv  this 
ambiguity  make  straightforward  applications 
of  previous  parsing  techniques  too  lengthv 
and  complex  to  consider. 


We  have  attempted  to  reduce  the 
combinatorial  problem  by  the  following 
methods:  semantic  and  pragmatic  pre-selection 
of  small  subsets  of  the  total  word  lattice: 
the  use  of  semantic  guidance  during  parsing; 
a  basically  top  down  parsing  algorithm  with 
backup  capabilities  so  that  not  all  paths 
need  be  followed  in  parallel;  a  mechanism  to 
allow  ordering  of  the  paths  so  that  only  the 
best  are  processed;  merging  of  register 
information  when  possible,  use  of  the  WFST  to 
avoid  re-parsing  constituents  which  have 
already  been  found;  and  sharing  syntactic 
information  among  theories  to  avoid 
re-parsing  wherever  possible. 

That  these  methods  do  substantially 
reduce  the  work  required  can  be  shown  by  an 
example  which  has  been  parsed  bv  the  system. 
The  utterance  was  "How  many  samples  contain 
silicon?"  and  the  word  lattice  contained  all 
the  correct  words  as  well  as  "give"  in  tht 
same  place  as  "how"  and  "any"  in  the  same 
place  as  "many."  Using  a  grammar  of  43  states 
and  102  arcs,  beginning  with  a  theory  for 
"samplefs)  contain  silicon,"  and  processing 
an  event  for  each  of  the  other  four  words,  it 
is  estimated  that  a  parser  without  the 
ability  to  share  transitions  and 
configurations  among  several  theories, 
without  backup,  and  without  the  WFST  would 
create  about  300  configurations  and  nearly 
500  transitions.  The  BUN  speech  parser 
actually  constricted  a  total  of  104 
configurations  and  142  transitions.  The 
parser  was  operating  without  semantic 
guidance  or  merged  register  information  — 
with  these  features  a  reduction  in  the  number 
of  transitions  and  configurations  of  about 
one  third  could  be  expected  for  this  example. 

Much  more  work  remains  to  be  done, 
particularly  in  the  areas  of  semantic 
guidance  and  the  inclusion  of  prosodic 
information,  but  we  have  established  a 
framework  which  will  allow  for  considerable 
experimentation  with  various  strategies.  We 
expect  the  system  to  serve  as  a  tool  to  help 
us  learn  about  the  relationship  between 
syntactic  knowledge  and  the  understanding  of 
natural  language. 
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One  function  of  tho  Semantics  component 
of  SPFECHLIS,  the  BBN  Speech  Understanding 
System,  is  to  gather  evidence  for  hypotheses 
it  has  made  reoarding  the  content  of  an 
utterance,  as  well  as  to  evaluate  the 
hypotheses  made  by  other  components.  Another 
is  to  produce  a  representation  of  the 
utterance's  meaning.  Specifically,  this 
involves  forming  consistent,  meaningful 
collections  of  words  which  ma' ch  regions  of 
the  speech  waveform,  and  evaluating  and 
interpreting  the  possible  syntactic 
structures  built  o f  them.  This  paper 
discusses  the  data  structures  and 
organization  of  SPEKCIILIS  semantics  anti  how 
they  are  directed  to  the  above  two  tasks. 

Introdur  '■ion 

Psychologists  have  demonstrated  that  it 
is  necessary  for  people  to  be  able  to  draw 
upon  syntactic  and  semantic  information  in 
their  understanding  o"  speech:  the  acoustic 
signal  they  hear  is  so  imprecise  and 
ambiquous  that  even  a  knowledge  of  the 
vocabulary  is  insufficient  to  insure  correct 
understanding.  For  example,  Pollack  and 
Pickett's  experiments  [3]  with  fragments  o" 
speech  excised  from  eight-word  sentences  and 
played  to  an  audience  showed  that  9fl% 
intelligibility  was  not  achieved  until  a 
fragment  spanned  six  of  the  eight  words.  and 
its  syntactic  and  semantic  structure  were 
becominq  apparent.  Similarly,  the  apparent 
impossibility  of  building  a  "phonetic 
typewriter*  (a  machine  for  takinq  dictation 
and  producing  English  text)  or  of  extending 
systems  capable  of  single-word  recognition  to 
ones  capable  of  recoonizinq  continuous  speech 
seems  to  imply  that  this  ability  to  draw  on 
syntactic  and  semantic  information  is 
necessary  for  computers  too.  Without  making 
any  claims  about  how  a  person  actually 
understands  speech,  this  paper  will  present 
some  kinds  of  semantic  knowledge  and  the  ways 
in  which  they  can  help  a  listener  to 
underst ind  an  utterance.  An  initial  atterpt 
to  organize,  represent  and  use  such  semantic 
knowledge  in  SPEECHLIS  will  also  be 
described. 

Klnd3  of  Semantic  Knowledge 

Senantic  knowledge  of  the  names  of 
familiar  things  and  of  models  fer  forming  the 
names  of  new  ones  permits  a  listener  to 
expect  and  hear  words  which  make  sense  by 
naming  things  which  he  knows.  For  example, 
knowinq  tho  words  "iron"  and  "oxide",  their 
meaninqs,  and  that  a  particular  oxide  (or  set 
of  them)  may  be  specified  bv  putting  the  name 
of  a  metal  before  the  word  "oxide"  can  enable 
a  listener  to  hear  the  sequence  "iron 
oxides",  rather  than  "iron  ox  hides"  or  even 
"Ira  knocks  sides". 

Knowledge  of  how  concepts  can  be 


expressed  linguistically  enables  the  listener 
who  expects  to  hear  a  particular  concept  to 
tune  himself  for  words  and  phrases  which  can 
realize  it.  For  example,  all  of  the 
following  are  possible  realizations  of  a 
concept  the  listener  might  be  expecting: 


samples  with  no  sodium 
samples  which  do  not  contain  sodium 
samples  in  which  sodium  does  not  occur 
sodium-free  samples 

Knowledge  of  lexical  semantics  (models 
of  how  words  are  used)  enables  the  listener 
to  predict  and  verify  the  possible  surface 
contexts  of  particular  words.  For  example, 
contain  names  a  two  place  relation.  When 
it  is  used  in  an  active  sentence,  its  suoject 
is  to  he  understood  as  a  location  or  holder, 
and  its  object,  as  something  capable  of  beinq 
located  or  held.  In  a  passive  sentence,  the 
active  object  becomes  the  passive  subject  and 
the  active  subject  or  location  is  realized  in 
a  prepositional  phrase  headed  by  "in*. 

Every  lunar  breccia  contains  silicon. 

(Active) 

Silicon  is  contained  in  every  lunar 
breccia.  (Passive) 

This  knowledge  also  enables  a  listener  to 
hear  things  which  make  sense  rather  than 
things  whicli  don't.  For  example,  the 
following  are  two  possible  transcriptions  of 
the  same  utterance.  The  first  is  more 
likely,  since  the  subject  of  the  second, 
though  an  acceptable  nour.  phrase,  cannot  be 
understood  as  a  location  or  holder. 

Washington's  tin  contains  traces  of 
sulfur;  Oregon's  does  not. 

Washing  tungsten  contains  traces  of 
sulfur;  Oregon's  does  not. 

Semantic  knowledge  of  the  meanings 
convoyable  bv  different  syntactic  structures 
enables  the  listener  to  hear  cues  to 

syntactic  structure  which  might  otherwise  be 
lost.  Syntactic  structure  is  often  signalled 
by  very  weak  acoustic  cues  such  as  small 
function  words  like  prepositions  and 
determiners.  The  knowledge  of  what  semantic 
relations  can  meaningfully  hold  between  two 
concepts  can  often  help  in  recovering  the 
syntactic  cues  which  signal  them.  For 
example,  the  preposition  "of"  can  get  lost  in 
an  utterance  of  "analyses  of  ferrous  oxide". 
Yet  the  only  meaningful  relation  between 
analyses"  and  "ferrous  oxide"  that  can  be 
expressed,  given  this  word  order,  demands 
that  "ferrous  oxide"  be  realized  as  a 
prepositional  phrase  headed  by  "of"  or  "for": 
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'analyses  ferrous  oxide"  is  meaningless. 
™fa .  envies  a  listener  to  hear  the  "of" 
which  might  otherwise  have  been  lost. 

The  SPCECIILIS  Environment 

(SPEECH!  to  BDN  S?;aC“  , Undara tending  eyetem 
1SPLECHL1S) ,  an  effort  is  underway  to  provido 

kinH«4,m,WO#k  in  whfch  above-mentioned 

kinds  of  semantic  knowledge  may  be 
represented  and  used  to  produce  an 
appropriate  semantic  interpretation  of  an 
input  utterance. 

of  cnIBronf'icdi8CUSSfnq  tho  curreit  embodiment 
of  SPf.ECIILlS  semantics  in  detail,  it  would  be 
usefui  to  describe  briefly  the  environment  in 

°P°ratfs-  a  more  detailed 

exposition  of  the  SPri.CIILIS  world,  see 
references  4  and  6.) 

throe  higher- level  components 
comprising  the  system's  knowledge  of  syntax, 
semantics,  and  praonatics  work  to  produce  a 
syntactically  sound,  semantically  meaningful 
nd  pragmatically  felicitous  reconstruction 
of  the  original  utterance.  Input  to  these 
components  is  a  Word  Lattice  which  is 
^nHdU?fd<  ^  a co us t i c- phone t i c ,  phonological 
and  lexical  retrieval  programs  from  an 
analysis  of  the  input  utterance.  Entries  in 
the  word  lattice  are  words  which  are  found  to 
be  likely  matches  in  regions  of  the  speech 
aveform.  Because  there  may  be  more  than  one 

lifer  WMd  ,n"tcb  in  anV  region,  a  lattice  of 
aiternative  possible  word  matches  results, 

rather  than  a  single  string.  Associated  with 
each  such  word  match  is  a  description  of 
where  it  matched  and  how  well.  Initially 
oniy  words  of  three  or  more  phonemes  in 
qbh,are  included,  since  shorter  words  tend 
to  produce  possible  matches  everywhere. 

on»r,.f^et,0f  contrjl  Programs  determines  the 
operational  sequence  of  the  other  parts  of 
the  system  -  who  does  what  when  -  keep  ng 

*.°f  Wf!at  haS  alroadV  been  done  and  what 
is  left  to  do. 

The  hiqher-level  components  work 
together  to  produce  Theories.  A  theory  is  a 
hypothesis  that  a  set  of  word  matches  belongs 
to  an  utterance.  This  set  need  not  span  the 
utterance,  but  may  only  cover  parts  of  it.  A 
theory  contains  information  about  how  the 

«!nd  C3n  fit  together  syntactically, 

semantically  and  pragmatically,  as  well  as 

°5hhow  con,idnn*  each  component  is 
in  that  theory.  Various  events  may  happen 

tenrtnqfthe  K,,alysis  of  a  theory  which  would 
te"d  t°  change  the  weight  of  the  theory,  to 
make  SPEECIILIS  more  or  less  happy  with  it 
For  example,  if  no  word  could  be  matched  just 
to  the  riqht  of  a  given  word  maten,  we  would 
be  less  certain  about  its  being  in  the 
original  utterance.  On  the  other  hand,  were 

match1  for wo11  to  the  right  of  a  word 

match  for  lunar  ,  we  would  be  more  confident 

,r«.r,Aa.'!!  .;.n 

appropriate  Notice  when  one  has  occurred. 
Examples  of  semantic  monitors  and  events  will 
be  found  further  on  in  this  paper. 


Organization  and  Uae  of 
Semantic  Knowledge  ln~gFEE?HLlS 

Organizational  factors 

The  aaquanca  of  word*  which  lay  behind 
the  utterance  for  ita  speaker  may  not  be  ita 
only  reading  for  a  llateneri  -hie  wheat  germ 
and  honey  could  easily  be  heard  aa  "hia 
aweet  r.erman  honey".  in  order  for  the 

listener  to  arrive  at  the  same  reading  of  the 
utterance  as  its  speaker,  he  must  be  able  to 
use  whatever  cues  he  can  get  from  the  speech 
signal  to  reconstruct  the  entire  utterance. 
Moreover,  the  precision  with  which  people 
speak  varies,  so  that  the  strong  cues  -  those 
which  the  li.tener  feels  he  can  moat 

priorilably  trust  -  cannot  be  determined  a 

Reconstructing  the  utterance  starting 
from  one  of  its  parts  requires  modela  of 
possibio  utterance  constituents  and  the 
ability  to  access  these  models  starting  from 
any  part.  These  may  be  modela  of 

syntactically  well-formed  constituents  -eg 
noun  phrases  -  as  well  as  models  of 

semantically  meaningful  ones.  A  semantics 
system  for  speech  understanding  must  not  be 
to  accessing  its  semantic 
models  in  one  way  because  that  way  may  not  be 
suggested  by  the  available  cues.  This  was 
not  a  strong  factor  in  previous  semantic 
systems  designed  for  the  automatic  analysis 
o  written  text  where  the  seguence  of  words 
in  the  input  was  known.  For  example,  the 
semantic  models  in  the  BBN  LUNAR  system  [51. 
built  for  NASA  to  answer  written  English 
guestions  about  the  Apollo  11  lunar  samples, 
templates  on  syntactic  tree  structures 
which  specified  selectional  restrictions  on 

*?aves,  and  were  classified  and 
retrievable  only  by  their  head  noun  or  verb 
The  programs  which  did  the  template  matching 
miohf  begi"  to  determine  which  templates 
might  be  applicable,  and  thus  what  other 
parts  of  the  model  to  look  for,  without  the 
identification  of  the  head  noun  or  verb.  For 
speech  applications,  semantic  information 
must  be  organized  and  accessible  in  ways  that 
will  enable  the  listener  to  make  use  of  the 
strong  cues  he  did  hear,  even  if  the  head 
noun  or  verb  is  garbled  or  misheard. 

Data  Structures 

data  structures  of  SPEECHLIS 
semantics  have  been  designed  to  represent  the 
kinds  of  semantic  knowledge  mentioned  above 
in  a  way  that  allows  flexible  access.  Hie 

£!d  g«ncipal  8tri‘ctures  -  «  semantic  network 
and  case  frane  tokens  -  are  discussed  below 
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SMALL  SEMANTIC  NETWORK 
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figure  1 


The  Semantic  Network.  A  semantic 
network,  a  tiny  piece  of  which  is  shown  in 
Tigure  1,  represents  the  associations  among 
words  and  concepts.  (The  names  of  nodes 
which  do  not  correspond  to  specific  English 
words  are  indicated  in  capita  letters.) 
There  are  currently  three  types  of  nodes  in 
the  network.  The  first  kind  corr  spends  to 
concepts  named  by  single  English  words  like 
■ferric",  "iron",  and  "contain". 

The  second  type  of  node  corresponds  to 
concepts  like  "fayalitic  olivine"  which  we 
refer  to  as  "multi-word  names".  They  are 
identified  by  the  types  of  arcs  entering  and 
leaving  them.  Osubset,  dsubstuff,  and 
dsuperc  links  say  that  one  concept  is  a 
subset,  substuff,  or  superconccpt  of  another 
by  definition.  That  is,  while  both  fayalitic 
olivine  and  peridotite  are  types  of  olivine, 
only  the  former  is  so  by  definition.  A  mod 
link  goes  to  the  node  which  effects  the 
subcategorization  of  the  original  concept. 
(The  mod  link  does  not  specify  how  the 
meaning  of  the  modifier  affects  the  meaning 
of  the  modified  concept.  For  example,  the 
links  between  the  nodes  for  "fayalitic 
olivine"  and  "fayalitic*  and  those  for 
■principal  investigator"  and  "principal"  are 
both  mod  links,  though  the  modifiers  do  not 
affect  the  nouns  in  the  same  way.) 

Semantics  uses  its  knowledge  of 
multi-word  names  to  propose  additional  words 
to  the  word  matcher.  That  is,  given  a  word 
match,  the  rest  of  a  multi-word  name  of  which 
its  word  is  a  part  might  have  occurred  in  the 
original  utterance,  but  be  missing  due  to 
poor  match  quality.  The  effort  the  word 
matcher  spends  here  depends  on  how  necessary 
it  is  for  the  word  match  to  be  part  of  a 
multi-word  name.  For  example,  given  a  word 
match  for  "oxide",  Semantics  would  ask  the 


word  matener  to  look  for  "ferrous"  or 
■ferric*  to  the  left  of  "oxide",  naming 
"ferrous  oxide"  or  "ferric  oxide*.  Given  a 
match  for  "ferric"  or  "ferrous",  Semantics 
would  ask  it  to  look  much  harder  for  "oxide", 
since  neither  "ferrous"  nor  "ferric"  could 
appear  in  an  utterance  alone. 

The  third  type  of  node  represents  a 
relation  -  a  concept  which  takes  arguments. 
For  example,  the  relation  named  by  "analysis" 
takes  two  arguments  --  an  instantiation  of 
the  concept  CONSTITUENT,  e.q.  "iron",  and 
one  of  the  concept  SAMPLE,  e.g.  "each 
breccia" . 

Semantics  uses  its  knowledge  of  words, 
multi-word  names,  and  relations  to  construct 
theories  for  meaningful  sets  of  word  matches. 
Given  a  word  match.  Semantics  follows  arcs 
through  the  network,  looking  for  multi-word 
names  and  relations  of  whicli  it  or  a  concept 
that  it  instantiates  may  be  a  part.  On  each 
of  the  other  components  of  the  name  or 
relation.  Semantics  sets  monitors.  Should  an 
event  occur  in  which  a  monitored  component  is 
instantiated  and  both  general  and  specific 
conditions  are  met,  the  monitor  creates  a 
notice  calling  for  the  construction  of  a  new, 
expanded  theory. 

To  see  this,  consider  again  the  network 
shown  in  figure  1  and  a  word  match  for 
"oxide".  Semantics  would  find  that  "oxide" 
is  one  of  the  components  of  "ferrous  oxide" 
and  "ferric  oxide",  and  would  set  monitors  on 
the  nodes  corresponding  to  "ferrous"  and 
"ferric"  with  the  specific  condition  that  any 
match  for  which  a  notice  is  to  be  created 
appear  to  the  immediate  left  of  "oxide". 
Word  matches  which  trigger  these  monitors 
must  also  satisfy  the  general  condition  which 
disallows  overlapping  word  matches. 
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Semantics  would  also  find  that  oxides  could 
be  constituents  of  rocks  and  a  constituent 
could  be  one  argument  to  the  relation  named 
by  "analysis"  -  "analyses",  the  other  being 
the  concept  SAMPLE.  (Note  that  a  node  which 
can  be  referred  to  by  the  root  form  of  a  word 
is  also  referred  to  by  any  inflected  form.) 
Dotli  nodes  corresponding  to  "analysis* 
"analyses"  and  to  SAMPLE  would  be  monitored. 
Subsequently,  a  non-overlapping  word  match 
for  "analysis*  or  "analyses*  or  one  which 
instantiates  SAMPLr  (e.g.  "rock")  would  be 
seen  by  a  monitor  and  result  in  the  creation 
of  a  notice  linking  "oxide"  with  the  new  word 
matcti. 

Each  notice  has  a  weight  representing 
how  confident  Semantics  is  that  the  resulting 
theory  is  a  correct  hypothesis  about  the 
original  utterance.  In  the  above,  Semantics 
is  less  certain  that  a  theory  for  "rock"  and 
"oxide"  will  eventually  instantiate  the 
concept  ANALYSIS  than  will  a  theory  for 
"analysis"  and  "oxide".  The  event  for  the 
latter  is  given  a  higher  weight  than  the 
former.  (One  is  more  certain  that  a 
particular  relation  has  been  expressed  if  one 
has  heard  its  name  mentioned  rather  than  one 
or  more  of  its  possible  arguments.) 

Case  Frame  Tokens .  The  semantic  network 
indicates  tfie  existence  of  relationships 
between  concepts.  Case  frames  (2]  on  the 
other  hand  describe  how  these  relationships 
hold  and  how  the  relation  may  be  expressed  in 
an  utterance.  Associated  with  each  relation 
is  a  case  frame  such  as  the  one  shown  in 
Figure  2  for  ANALYSIS. 

A  case  frame  is  divided  into  two  parts: 
the  first  part  contains  information  relating 
to  the  case  frame  as  a  whole;  the  second, 
descriptive  information  about  the  cases.  (A 
case  usually  names  an  arqument  place  in  a 
relation,  but  we  have  extended  its  use 
somewhat  to  include  the  relation  itself  as  a 
case,  specifically  the  head  case  (NP-IIEAD  or 
S-HEAD)  ,  This  allows  a  place  for  the 
latter's  instantiation  in  an  utterance,  as 
well  as  the  instantiations  of  each  of  the 
arguments. ) 

Amonq  the  types  of  information  in  the 
first  part  is  a  specification  of  whether  a 
surface  realization  of  the  case  frame  will  be 
parsed  as  a  clause  (REALIZES  .  CLAUSE)  or  as 
a  noun  phrase  (REALIZES  .  NOUN-PHRASE).  If 
as  a  clause,  further  information  specifies 
which  cases  are  possible  active  clause 
subjects  (ACTIVSEBJ's)  and  which  are  possible 
passive  clause  subjects  (OTIIERSUBJ ' s) . 

(While  not  usual,  there  are  verbs  like 
"break"  which  allow  several  possible  cases  to 
become  the  active  subject,  but  the  order  in 
which  they  are  chosen  is  determined  a  priori 
by  which  cases  are  actually  present.  Thus, 
the  cases  in  ACTIVSUUJ  are  ordered,  given  the 
presence  or  absense  of  each  case.  However, 
there  is  no  preferred  order  in  selecting 
which  case  becomes  passive  subject,  so  the 
case  names  on  OTIIERSUBJ  are  not.)  The  first 
part  of  the  case  frame  may  also  contain  other 
information  sucli  as  inter-case  restrictions 
as  would  apply  between  instantiations  of  the 
object  and  qoal  cases  of  RATIO  -  that  they  be 
measurable  in  the  sane  units. 


Tne  second  part  of  the  case  frame 

contains  descriptive  information  about  each 
case  in  the  frame: 

a)  its  name,  e.g.  NP-OBJ,  S-HEAD  (The  first 

part  of  the  names  gives  redundant 
information  about  the  frame's  syntactic 
realization:  *NP"  for  nohn  phrase  and  "S* 

for  clause.  The  second  part  is  a 

Fillmore- type  (2)  case  name  or  an 

abbreviation  of  one:  "OBJ"  for  object, 
"AOT"  for  agent,  "GOAL"  for  goal,  etc.) 

b)  the  way  it  can  be  filled  -  whether  by  a 

synonym  for  a  concept  (EOU)  or  by  an 
instantiation  of  it  (MEM),  e.g.  (EQU  . 
SAMPLE)  would  permit  "sample"  or  "lunar 

sample"  to  fill  the  case,  but  not 

"breccia"  which  refers  to  a  subset  of  the 
samples . 

c)  a  list  of  prepositions  which  could  signal 

the  case  were  it  realized  as  a 

prepositional  phrase  (PP) .  If  the  case  is 
not  realizable  as  a  PP,  this  entry  will  be 
NIL. 

d)  an  indication  of  whether  the  case  must  be 

explicitly  specified  (OBL) ,  whether  it  is 
optional  and  unnecessary  (OPT),  or 
whether,  when  absent,  will  be  derivable 
from  context  (ELLIP, .  For  example,  in 

“.he  bullet  hit.*,  the  object  case  -  what 
was  hit  -  will  be  derivable  from  context. 

Tasks 

Two  tasks  of  SPEECHLIS  Semantics  have 
already  been  mentioned  in  the  section  on  data 
structures:  to  propose  additional  words  which 
miqht  have  occurred  in  the  original  utterance 
but  were  missing  from  the  initial  word 
lattice  because  of  poor  match  quality,  and  to 
construct  meaningful  sets  of  word  matches 
from  a  la*-  .rce  of  possible  ones.  A  third 
task  of  Semantics  is  to  evaluate  the 
consistency  of  syntactic  structures  and 
semantic  hypotheses. 

Semantic  Evaluation.  As  more  word 
matches  are  included  in  a  theory.  Semantics 
represents  its  hypotheses  about  their 
semantic  structure  in  case  frame  tokens. 
Those  are  case  frames  which  have  been 
modified  to  show  which  word  match  or  other 
case  frame  token  fills  each  instantiated 
case . 

The  two  case  frame  tokens  in  fiqure  3 
represent  semantic  hypotheses  about  how  the 
word  matches  for  "analyses",  "ferrous*  and 
oxide*  fit  together.  "Analyses*  is  the  head 
(MP-HEAD)  of  a  case  frame  token  whose  goal 
case  (NP-  GOAL)  is  filled  by  another  case 
frame  token  representing  "ferrous  oxide". 
Another  way  of  showing  this  is  in  the  tree 
format  of  figure  4.  There  are  a  small  number 
of  syntactic  structures  that  each  possible 
set  of  cases  can  be  realized  as:  here,  the 
head  case  must  correspond  to  the  syntactic 
head  and  the  goal  case  must  be  realized  as 
either  a  prepositional  phrase  or  adjectival 
modifier  on  the  head.  Thus,  in  figure  5, 
syntactic  structures  (a)  and  (b)  would 
confirm  the  semantic  hypotheses  in  figure  3, 
while  (c),  where  "analyses"  modifies  "oxide", 
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would  not.  Notice  that  the  only  difference 
between  the  tem>inal  strings  of  (a)  and  (c) 
is  the  presence  of  the  preposition  "of". 
Yet,  this  small  word  makes  the  difference 
between  an  acceptable  syntactic  structure  and 
an  unacceptable  one. 

As  the  syntactic  component  of  SPEECHLIS 
(see  (11)  builds  structures,  Semantics 
evaluates  them  against  its  hypotheses  and 
assigns  a  score  to  them  which  depends  on  how 
many  of  its  hypotheses  are  f  ilfilled  and  how 
much  material  on  the  syntactic  tree  violates 
or  is  not  part  of  Semant  cs'  hypotheses. 

Other  Toaks 

Semantics  has  two  other  tusks  in 
SPEECHLIS  whose  implementations  are  not  far 
enough  along  to  describe  in  detail.  First, 
Semantics  should  guide  Syntax  to  the  most 
meaningful  parse  as  directly  as  possible. 
That  is.  Syntax  should  not  make  random 
choices  in  places  where  Semantics  has 
information  that,  can  be  used  to  order  the 
choices.  This  will  be  implemented  via 
Syntax's  ability  t.o  ask  questions  of 
Semantics  on  the  arcs  of  the  Transition 
Network  Grammar  (1],  and  will  eliminate  the 
need  to  wait  until  a  well-formed  substring 
with  syntactic  structure  is  created  before 
getting  a  measure  of  meaningfulness  from 
Semantics . 

Finally,  Semantics  should  transform  the 
best  theory  (or  theories)  about  an  utterance 
into  a  formal  procedure  for  operating  on  its 
data  base  in  order  to  answer  questions  or  to 
absorb  new  information.  This  is  where 
•speech  understanding"  differs  from  "speech 
recognition".  SPEF.CHLIS  will  not  have  to 
distinguish  among  best  theories  which  mean 
the  same  thinq  (i.e.  are  mapped  into  the 
same  formal  procedure  ),  though  differing  in 
the  exact  words  they  contain.  Many  of  the 
interpretation  methods  that  we  used  in  the 
LUNAR  system  we  expect  to  carry  over  into  the 
speech  world. 


CASE  FRAME  FOR  ANALYSIS 


(((  Realizes  .  Noun-Phrase)  I 
(  Np-Head  I  Equ  .  14  (  Nil  Obi  I 
I  Np-Coal  (Mem  .11  (Of  For)  Ellip) 

(Np-loc  (Mem  .  7)  (In  For  Of  On)  Ellip)) 


Concept  14  -  Concept  of  Analysis 

Concept  1  -  Concept  of  Component 

Concept  7  *  Concept  of  Sample 


figure  2 


Current  State  of  SPEECULIS  Semantics 

Based  on  a  vocabulary  of  approximately 
175  content  words  on  lunar  geology  and  the 
names  of  tho  43  Apollo  11  samples,  a  semantic 
network  has  been  constructed,  containing 
approximately  350  nodes.  (The  other  75  words 
in  the  SPEECHLIS  vocabulary  are  function 
words  -  determiners,  prepositions, 
auxiliaries,  and  conjunctions  -  whose 
meanings  do  not  seem  to  be  the  types  of 
things  the  current  network  can  represent. )  He 
have  run  the  higher  level  components  on  only 
a  small  number  of  word  lattices,  so  the 
following  results  are  only  preliminary 
impressions.  In  analyzing  an  utterance,  each 
new  theory  seems  to  set,  on  the  average,  5-6 
monitors  on  nodes  of  the  network.  This  is 
not  so  extraordinarily  low,  as  a  theory  for  a 
verb  word  match  will  only  set  monitors  on  the 
arguments  to  the  relation  it  names,  and  the 
number  of  arguments  to  any  relation  rarely 
exceeds  three.  On  the  average,  4-5  event 
notices  will  be  create'  during  the  processing 
of  each  theory.  Many  of  these  events  are 
very  unlikely,  and  experiments  with  pruning 
strategies  -  not  creating  unlikely  events 
seem  to  show  good  results,  with,  on  the 
average,  one  notice  being  built  per  theory. 

Many  problems  remain  to  be  solved:  for 
example,  SPEECHLIS  Semantics  will  be  extended 
to  larger  vocabularies  arj  larger  semantic 
networks.  It  is  not  clear,  for  example,  by 
how  much  the  network  would  grow,  were  the 
vocabulary  size  to  double,  triple,  or  even 
quadruple,  or  were  we  to  want  to  use  the 
network  for  other  tasks  such  as  inference 
making.  But  we  take  for  granted  now  the 
important  role  that  Semantics  must  play  in 
automatic  speech  understanding,  so  these  and 
many  other  problems  will  have  to  be  faced. 


CASE  FRAME  TOKENS 


[Clt  #6 

t  ( (  Realizes  Noun-  Phrase  )  ) 

(  Np-HeaO  I  Analyses  .  14  )  Nil  Ob)  I 
I  Np-Goa)  (  Clt^S  .  1)  I  01  For  )  Ellip  I 
(Np-loc  (Me  .  I  )  (  InForOIOnl  Ellip)  )j 


[Cft  #5 

1 1 1  Realizes  .  Noun  Phrase  I 
I  Case  ol  Clt  it )  ) 

(  Np-Mod  (Ferrous  13)  Nil  Obi  1 

I  Np-Head  (Oxide  .  5)  Nil  Obi  I  I  ] 


figure  3 
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Np-LOC 


Np-Mod 
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<a)  (b) 


(c) 


N  Ferrous  Oxide 


Analyses 
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